Six failure families. Each has a documented Phoenix incident and a second evidence layer from supervisor/implementation operational logs outside git.
Each case uses two independent evidence layers:
Git history records file changes and final states. It does not reliably capture when drift was first noticed, why a run was invalidated, or how false-success conditions were detected. The second layer preserves that reasoning.
V4_SKI_CHALET_ASSESSMENT_REPORT.md
In all three recorded V4 ski-chalet rows, the local model preserved the user's intent but immediately proposed non-Phoenix execution paths on the first pass:
PlayerAgent toolTennisStatsPro; gave a directional answer without grounded tool executionTennisRankingsAPI; described expected output generically instead of executing get_career_trajectoryThe failure pattern in each row was identical: correct intent, wrong execution posture, immediate reach for generic or invented tools.
Supervisor notes confirm that the redirect in each row was explicit and bounded — candidate domain plus candidate tool set — and that the local model did not require a second correction after the redirect. Final answers were grounded in executed deterministic tool outputs, not confabulated summaries.
One structured redirect per row: do not use web or invented tools; use only deterministic Project Phoenix tools; candidate domain X; candidate tools [A, B, C]. No semantic takeover required.
Hold the candidate-tool list at session start. Make the allowed execution surface explicit in the harness prompt rather than expecting the local model to infer it autonomously.
WHITE_PAPER_LOCAL_VS_CLAUDE_TENNIS_DOMAINS.md, section 2.1
Before the clean V1–V3 construction matrix was defined, the benchmark work drifted into larger task packs, broader domain expansion, and heavy-query runtime confounding. The signal was already sufficient to answer the construction question — but the project kept climbing anyway. The paper explicitly names this: "That failure mode was named summit fever: the project kept climbing after the signal was already sufficient."
Comparability weakened, runtime confounds grew, and the earlier characterization of supervisor overhead could no longer be read cleanly against the clean-matrix results.
The clean matrix was defined in response to this failure. V3 was explicitly separated from the baseline and normalized into a bounded 12-task pack rather than folded into an expanding experiment. The stop conditions introduced for the clean matrix are the artifact the summit fever episode left behind.
Introduced explicit stop conditions before the clean matrix began; separated V3 as a distinct bounded escalation experiment with fixed pack sizes; froze comparison scope before execution.
Define stop conditions before the experiment starts. Fix scope at the outset — scope corrections mid-run require reclassification of the work already done, not re-labeling.
WHITE_PAPER_LOCAL_VS_CLAUDE_TENNIS_DOMAINS.md, sections 3.4 and 6
The initial V4 rows were gathered against a misframed question — generic domain usability — rather than the intended ski-chalet pattern. The rows appeared to produce useful V4 evidence, but they were testing the wrong thing. A system that passes a usability check when a strong assistant operates directly is not the same as a system that passes when the strong assistant must work through the local model.
When the framing error was caught, the misframed rows were reclassified to control-only status. A second instance: rows were labeled SOFT_PASS rather than PASS precisely to prevent the same error — the local model recovered correctly, but it did not stay on the Phoenix tool surface autonomously, so an unqualified pass label would have overstated the result.
Supervisor notes confirm the explicit reclassification decision and the reasoning: the direct-control rows could not stand as ski-chalet evidence because the experimental question was different. The SOFT_PASS label was an explicit supervisor decision, recorded to prevent the result from being read as autonomous stability when it was not.
Reclassified misframed rows as control-only evidence; corrected the experimental frame; re-ran rows explicitly inside the ski-chalet setup; applied SOFT_PASS rather than PASS where autonomous stability was not demonstrated.
Freeze the experimental frame before gathering rows. Framing corrections require explicit reclassification, not re-labeling — a row cannot be promoted retroactively into evidence for a question it was not designed to test.
docs/RUN_TRACE_PROOF_OF_USE_001.md; ShowcaseAgent routing benchmark
During forced-LLM routing evaluation of ShowcaseAgent, the trace layer surfaced two structurally coherent routing failures — correct query intent, wrong domain selected:
"explain lasso vs ridge regression" → routed to WQ; expected Stan"show teaching complexity score" → routed to Stan; expected ParableAgentThe system was not confused or incoherent. It selected a plausible domain, committed to it, and the run proceeded. The trace note records: "These do not look like random failures. They look like domain-boundary ambiguity."
Without the trace layer, these would have remained as isolated row failures in the benchmark CSV. The trace grouped them under a single failure code (routing_error) and exposed them as a replayable miss family. The useful unit was not "fix two rows independently" but "inspect the router seam between Stan vs WQ and ParableAgent vs Stan."
The run trace summary artifact preserved the exact fields needed for replay: run_id, scenario_id, input_query, selected_route, raw_failure_reason. The operational log confirming the miss family was systematic rather than random.
The trace layer was built specifically to surface this class of failure. The response was not to fix individual rows but to identify the router seam as the repair target. No re-run was attempted until the seam boundary was understood.
A structured trace layer can convert isolated benchmark misses into replayable miss families. Bad context selection is not always visible as a failure — the run completes, the output is internally coherent — so a trace layer that preserves selection reasoning is the earliest detection point.
docs/DOOM_LOOP_ESCALATION_RECORD.md; experiments/tennis/AGENT_COMMUNICATION_LOG.jsonl
The DOOM_LOOP_ESCALATION_RECORD.md documents a formal escalation structure built in response to an observed pattern where an agent was repeating work without meaningful state gain. A representative logged escalation:
The tennis communication log records a related pattern at a lower intensity. A validator-interface rerun was needed after repeated --task_id flags only exercised the last task — the same command re-executed with the same flag repeatedly without correction, producing the same partial result each time. The supervisor subsequently corrected manual_interventions from 0 to 1 to reflect the required intervention.
A second instance across ski-chalet run logs: every use-check query required local-only schema correction before the plan stayed on the deterministic MatchAgent tool surface — a near-loop where each query required the same correction before the agent could proceed.
The doom loop escalation format itself is the operational artifact — it was created specifically because the pattern was recurring enough to need a durable escalation record. Communication log timestamps are the second layer confirming real instances.
dejavu flags the suspected loop; supervisor takes control; implementation becomes the explicit execution lane. Artifacts are preserved before the strategy changes. The escalation record structure ensures the reasoning is recorded outside git.
Do not record every hesitation or retry as a doom loop — the threshold is no visible state gain across meaningful action steps. When that threshold is crossed, preserve current artifacts first, then change strategy. The worker's self-summary is the key diagnostic: if it cannot describe what changed, the loop is confirmed.
experiments/tennis/AGENT_COMMUNICATION_LOG.jsonl; V3 bounded execution spec directive
On 2026-03-12, implementation logged run 20260312T013903Z as INVALID_CLAIM:
The run passed the surface validator check. The trusted execution path — full artifact preservation through the use-check phase — did not complete. The result looked done; it was not trustworthy.
Two runs were subsequently recorded as PARTIAL rather than allowed to carry any completion status:
20260312T033426Z: MatchAgent V1 and V2 both passed staged-artifact validation, but the use check stopped before any complete tool_results/final_answer artifacts were preserved. Validation passed; the use phase did not finish.20260312T042815Z: progress reached only V1 raw artifact generation before the run stopped. No staged validation, V2 generation, or use-check execution occurred.Both runs had real progress. Neither was trustworthy as a result. The supervisor named them PARTIAL rather than accepting surface-level indicators as completion.
Before V3 runs were executed, the supervisor issued a blocking directive: "runtime-based PARTIAL must require an objective trigger threshold rather than discretionary declaration" and "claimed completed prefix must be revalidated on the final artifact version for the row." The spec was tightened before the pattern could corrupt V3 data.
The communication log timestamps are the operational record. The INVALID_CLAIM classification, the PARTIAL labels, and the V3 spec directive are all explicit supervisor judgments recorded outside git.
Any run that passed a surface check but lacked full artifact preservation was reclassified — never promoted to completion status. INVALID_CLAIM was used when the false-positive path needed to be preserved as a runbook lesson. PARTIAL was used when real progress existed but the result was not trustworthy.
Surface validator exit code is not sufficient for completion. A result is complete only when the full trusted execution path — including artifact preservation through the final phase — has been confirmed. Stopping discipline must be objective: a PARTIAL threshold requires a defined trigger, not a discretionary call.
| Failure Family | Observed Incident | Supervisor Response | Preventive Standard |
|---|---|---|---|
| Tool-surface drift | V4 ski-chalet: local model proposed Wikipedia / invented tools in all 3 rows despite correct user intent | One structured redirect per row: candidate domain + candidate tool set; no semantic takeover | Hold candidate-tool list at session start; make execution surface explicit in harness prompt |
| Summit fever | Pre-clean-matrix benchmark expanded into larger packs and runtime confounds after construction signal was already sufficient | Introduced explicit stop conditions; separated V3 as a bounded 12-task escalation with fixed scope | Define stop conditions before the experiment starts; scope corrections mid-run require reclassification |
| False success / validator-path mismatch | Initial V4 rows gathered against misframed question (generic usability vs ski-chalet pattern); appeared to confirm progress but tested the wrong thing | Reclassified misframed rows as control-only; re-ran true ski-chalet rows; applied SOFT_PASS where autonomous stability was not demonstrated | Freeze the experimental frame before gathering rows; a row cannot be promoted retroactively into evidence for a different question |
| Bad context selection | ShowcaseAgent routing: "lasso vs ridge" → WQ (expected Stan); "teaching complexity score" → Stan (expected ParableAgent) — intent correct, domain wrong | Trace layer surfaced both as a routing_error family; repair target identified as the router seam, not individual rows | Build a trace layer that preserves selection reasoning; isolated row failures may be a miss family in disguise |
| Doom loops | Same patch retried repeatedly with no state gain; validator-interface repeated with same flag returning same partial result; schema correction required on every use-check query | dejavu flags the loop; supervisor takes control; artifacts preserved before strategy changes | Threshold is no visible state gain across meaningful actions; preserve artifacts first, then change strategy — do not retry the same path |
| Premature closure | Run 20260312T013903Z passed surface validator but lacked use-check artifact preservation → INVALID_CLAIM; two PARTIAL runs had V1/V2 pass but use phase incomplete | INVALID_CLAIM preserved the false-positive path as a runbook lesson; PARTIAL labels prevented incomplete runs from carrying completion status | Surface validator exit code is not completion — full artifact preservation through the final phase is required; PARTIAL threshold must be objective, not discretionary |
The Phoenix material points toward a specific operating discipline:
These practices matter because accountability is human. The team that selects the model, builds the harness, defines the process, and accepts the output owns the operational result.
Lessons learned and standards are not optional. They are the primary path to trustworthy use.