Why Strong Agentic Systems Still Collapse In Familiar Ways
Successful outputs vary widely — probabilistic systems can explore many useful paths.
Failures recur in recognizable families — structural hazards are not equally open-ended.
Waiting for more raw model power is not a complete answer.
Recent work with Claude, Codex, local agents, and supervised local workflows surfaces an important asymmetry: agentic coding successes vary widely, but failures recur in recognizable families.
The success side is diverse because these systems are probabilistic and path-dependent. Strong coding agents can produce materially different but still excellent outcomes — different sequencing choices, different abstractions, different code shapes.
The failure side is less diverse. Drift, summit fever, bad context selection, false success, validator-path mistakes, and doom-loop behavior appear again and again. These are not quirks of one model. They are recurring hazards of human-built probabilistic systems operating in difficult, underspecified environments.
The practical response is not to expect scaling alone to remove these hazards. It is to build better operating standards, better supervision, and better lessons learned.
The common public question hides the more important one:
"Which coding model is best?"
"Why do different strong coding agents produce very different successful outputs — while still failing in the same recurring ways?"
The most important differences are not just model differences. They are the interaction of probabilistic models, imperfect harnesses, limited context, and human operating discipline.
This paper emerges from Project Phoenix operations, not impressions. Work involved:
That work generated more than code changes. It also generated experiment runbooks, stop conditions, invalid-claim reclassification rules, supervisor interventions, communication logs outside git, and lessons learned from actual drift and recovery.
Git history records file changes and final states. It does not reliably capture when drift was first noticed, why a run was invalidated, or how false-success conditions were detected. Supervisor and implementation communication logs act as a second historical layer — the record this paper is built from.
The implementation agent makes the run happen.
The supervisor makes the result believable.
Documented supervisory interventions included: refusing to count artifact-free success claims, invalidating runs when the validator checked the wrong file, separating characterization from comparison, freezing stop conditions, and stopping unjustified scope expansion midstream.
Supervision did not merely improve answers. It improved trust.