Agentic Coding Failure Patterns

1 Asymmetry

The Core Finding

Successful outputs vary widely — probabilistic systems can explore many useful paths.
Failures recur in recognizable families — structural hazards are not equally open-ended.
Waiting for more raw model power is not a complete answer.

Executive Summary

Recent work with Claude, Codex, local agents, and supervised local workflows surfaces an important asymmetry: agentic coding successes vary widely, but failures recur in recognizable families.

The success side is diverse because these systems are probabilistic and path-dependent. Strong coding agents can produce materially different but still excellent outcomes — different sequencing choices, different abstractions, different code shapes.

The failure side is less diverse. Drift, summit fever, bad context selection, false success, validator-path mistakes, and doom-loop behavior appear again and again. These are not quirks of one model. They are recurring hazards of human-built probabilistic systems operating in difficult, underspecified environments.

The practical response is not to expect scaling alone to remove these hazards. It is to build better operating standards, better supervision, and better lessons learned.

The Operational Question

The common public question hides the more important one:

"Which coding model is best?"

"Why do different strong coding agents produce very different successful outputs — while still failing in the same recurring ways?"

The most important differences are not just model differences. They are the interaction of probabilistic models, imperfect harnesses, limited context, and human operating discipline.

Where The Evidence Comes From

This paper emerges from Project Phoenix operations, not impressions. Work involved:

Domain-building with strong hosted coding agents
Local-vs-hosted comparative experiments (TourAgent, ShowcaseAgent)
Supervised local workflows and ski-chalet offline testing
Implementation-agent and supervisor-agent collaboration

That work generated more than code changes. It also generated experiment runbooks, stop conditions, invalid-claim reclassification rules, supervisor interventions, communication logs outside git, and lessons learned from actual drift and recovery.

Git history records file changes and final states. It does not reliably capture when drift was first noticed, why a run was invalidated, or how false-success conditions were detected. Supervisor and implementation communication logs act as a second historical layer — the record this paper is built from.

The Supervisor Rule

The implementation agent makes the run happen.
The supervisor makes the result believable.

Documented supervisory interventions included: refusing to count artifact-free success claims, invalidating runs when the validator checked the wrong file, separating characterization from comparison, freezing stop conditions, and stopping unjustified scope expansion midstream.

Supervision did not merely improve answers. It improved trust.

→

Failure Families

All eight failure families with taxonomy definitions: drift, tool-surface drift, summit fever, bad context selection, false success, doom loops, premature closure, validator-path mismatch.

→

Case Evidence

Six failure families evidenced from documented Phoenix incidents — two evidence layers each: the artifact record and the supervisor/implementation operational log.