Failure Families | Agentic Coding

Why Families, Not Incidents

Individual failures look random. Failure families are structural. The distinction matters because it changes the response: a random incident calls for a one-time fix; a structural family calls for a standing standard.

These eight families emerged from repeated observation across different agents, different tasks, and different harness configurations. They recur because the underlying causes — probabilistic execution, underspecified environments, imperfect human-built runtimes — are not eliminated by model upgrades alone.

Families marked evidenced have documented Phoenix incidents with two evidence layers. Those marked taxonomy are covered by their adjacent evidenced families.

Drift taxonomy

The system moves away from the intended domain surface, benchmark frame, or allowed tool set — without fully losing the original task. The work continues but on the wrong track.

Drift is the broad class. Tool-surface drift (family 2) is its most common, most recoverable subtype in Phoenix work. The general drift pattern covers cases where the frame itself shifts rather than just the tool selection.

Adequately covered by Family 2 evidence in the current record.

Tool-Surface Drift evidenced

The system reaches for invented or generic tools instead of the real deterministic substrate. User intent is preserved; execution posture is wrong.

The V4 ski-chalet evidence shows the exact pattern: in all three recorded rows, the local model understood the question but immediately proposed non-Phoenix execution paths — a Wikipedia-style lookup, invented tool surfaces (TennisStatsPro, TennisRankingsAPI), generic directional answers in place of deterministic tool execution.

One structured redirect was enough to recover in each row. The failure is tractable — a control-layer problem, not a model-quality problem — but it recurs without explicit harness pressure.

Standard: Hold the candidate-tool list at session start. Make the allowed execution surface explicit in the harness prompt rather than expecting the local model to infer it autonomously.

Summit Fever evidenced

The project continues climbing after the useful signal has already been obtained. Scope expands, comparability weakens, runtime confounds grow.

The signal was already sufficient to answer the construction question. But the benchmark work drifted into larger task packs, broader domain expansion, and heavy-query runtime confounding anyway — because momentum filled the gap left by absent stopping discipline.

Summit fever is not a model-only problem. It is a project-control problem. The clean V1–V3 matrix was defined specifically in response to this episode.

Standard: Define stop conditions before the experiment starts. Scope corrections mid-run require explicit reclassification of work already done — not re-labeling.

Bad Context Selection evidenced

The wrong files, wrong artifacts, or wrong parts of the task are given priority. The run proceeds and may be internally coherent — but it is structurally wrong from the start because the routing decision was made against the wrong context.

The ShowcaseAgent routing evidence shows two structurally coherent failures where correct query intent was mapped to the wrong domain. These did not look like random failures — they looked like domain-boundary ambiguity at a specific router seam.

Bad context selection is not always visible as a failure. The run completes, the output is internally coherent. A trace layer that preserves selection reasoning is the earliest detection point.

Standard: Build a structured trace layer that preserves selection reasoning. Isolated benchmark misses may be a miss family in disguise — the useful unit is the router seam, not the individual row.

False Success / Validator-Path Mismatch evidenced

A result appears successful because the wrong validator, wrong file, or wrong artifact path was used. The trusted execution path failed; the surface check passed.

Two Phoenix instances: initial V4 rows gathered against a misframed question (generic usability, not the ski-chalet pattern) appeared to confirm progress but were testing the wrong thing; and individual runs that passed the surface validator while lacking full use-check artifact preservation were classified INVALID_CLAIM rather than allowed to count as results.

The SOFT_PASS label was a direct operational response to this hazard — preventing unqualified pass status when autonomous stability was not demonstrated.

Standard: Freeze the experimental frame before gathering rows. A row cannot be promoted retroactively into evidence for a question it was not designed to test.

Doom Loops evidenced

The system keeps cycling through a bad local pattern without a meaningful state correction. Repeated reformulation, repeated near-miss repair, or repeated return to the same bad operating posture — with no visible state gain.

The Phoenix record includes a formal escalation structure (DOOM_LOOP_ESCALATION_RECORD.md) built specifically because this pattern recurred. Representative logged escalation: "I am retrying the same fix path and not learning anything new from the reruns."

The doom loop threshold is not every hesitation or retry. It is no visible state gain across meaningful action steps.

Standard: Preserve current artifacts first, then change strategy. The worker's self-summary is the key diagnostic: if it cannot describe what changed, the loop is confirmed.

Premature Closure evidenced

The system declares a result finished before it has been confirmed trustworthy. A partial result passes a surface check; the full trusted execution path has not completed; stopping discipline is absent.

Phoenix evidence includes: one INVALID_CLAIM run that passed the surface validator but lacked use-check artifact preservation; two PARTIAL runs with real progress that were not trustworthy as results; and a V3 spec directive that blocked execution until an objective PARTIAL threshold was defined.

Standard: Surface validator exit code is not completion. A result is complete only when the full trusted execution path — including artifact preservation through the final phase — has been confirmed. The PARTIAL threshold must be objective, not discretionary.

Validator-Path Mismatch (Standalone) taxonomy

A specific subtype of false success where the validator path itself is structurally disconnected from the trusted execution path. The validator runs; it passes; the thing it validated was not the result.

This family is treated as a standalone taxon because the failure mechanism — validator disconnection — is distinct from the experimental-frame error in Family 5. In the current Phoenix record, both instances are covered by the False Success evidence, so a separate standalone case is not required.

Adequately covered by Family 5 evidence in the current record.

The Structural Explanation

These families recur because their root causes are structural, not accidental:

Probabilistic systems can explore good paths — and they can also drift toward bad ones without an explicit control signal redirecting them
Underspecified environments invite execution posture errors — the model knows what to do, not how the harness expects it to do it
Imperfect human-built runtimes create artifact preservation gaps, validator disconnects, and scope creep that model capability alone cannot compensate for
Momentum in the absence of stopping discipline reliably produces summit fever and premature closure

This explains why waiting for more raw model power is not a complete answer. Better models may reduce some mistakes. But drift, false success, bad context selection, and doom loops are operational hazards inside complex systems — not pure intelligence deficits.