Case Evidence | Agentic Coding

Two Evidence Layers

Each case uses two independent evidence layers:

Layer 1 — Artifact record: a documented paper, assessment report, benchmark CSV, or communication log entry that shows the failure occurred
Layer 2 — Supervisor/implementation log: operational reasoning, intervention decision, and boundary formation recorded outside normal version control

Git history records file changes and final states. It does not reliably capture when drift was first noticed, why a run was invalidated, or how false-success conditions were detected. The second layer preserves that reasoning.

Case 1: Tool-Surface Drift

evidenced

Source

V4_SKI_CHALET_ASSESSMENT_REPORT.md

Incident

In all three recorded V4 ski-chalet rows, the local model preserved the user's intent but immediately proposed non-Phoenix execution paths on the first pass:

Row 1 (Serena Williams, age at peak): proposed a Wikipedia/web-style lookup; gave a vague provisional answer rather than executing the deterministic PlayerAgent tool
Row 2 (Djokovic clay vs hard): invented a non-Phoenix tool surface named TennisStatsPro; gave a directional answer without grounded tool execution
Row 3 (Andy Murray career trajectory): invented a non-Phoenix tool surface named TennisRankingsAPI; described expected output generically instead of executing get_career_trajectory

The failure pattern in each row was identical: correct intent, wrong execution posture, immediate reach for generic or invented tools.

Second-Layer Evidence

Supervisor notes confirm that the redirect in each row was explicit and bounded — candidate domain plus candidate tool set — and that the local model did not require a second correction after the redirect. Final answers were grounded in executed deterministic tool outputs, not confabulated summaries.

Supervisor Response

One structured redirect per row: do not use web or invented tools; use only deterministic Project Phoenix tools; candidate domain X; candidate tools [A, B, C]. No semantic takeover required.

Preventive Standard

Hold the candidate-tool list at session start. Make the allowed execution surface explicit in the harness prompt rather than expecting the local model to infer it autonomously.

Case 2: Summit Fever

evidenced

Source

WHITE_PAPER_LOCAL_VS_CLAUDE_TENNIS_DOMAINS.md, section 2.1

Incident

Before the clean V1–V3 construction matrix was defined, the benchmark work drifted into larger task packs, broader domain expansion, and heavy-query runtime confounding. The signal was already sufficient to answer the construction question — but the project kept climbing anyway. The paper explicitly names this: "That failure mode was named summit fever: the project kept climbing after the signal was already sufficient."

Comparability weakened, runtime confounds grew, and the earlier characterization of supervisor overhead could no longer be read cleanly against the clean-matrix results.

Second-Layer Evidence

The clean matrix was defined in response to this failure. V3 was explicitly separated from the baseline and normalized into a bounded 12-task pack rather than folded into an expanding experiment. The stop conditions introduced for the clean matrix are the artifact the summit fever episode left behind.

Supervisor Response

Introduced explicit stop conditions before the clean matrix began; separated V3 as a distinct bounded escalation experiment with fixed pack sizes; froze comparison scope before execution.

Preventive Standard

Define stop conditions before the experiment starts. Fix scope at the outset — scope corrections mid-run require reclassification of the work already done, not re-labeling.

Case 3: False Success / Validator-Path Mismatch

evidenced

Source

WHITE_PAPER_LOCAL_VS_CLAUDE_TENNIS_DOMAINS.md, sections 3.4 and 6

Incident

The initial V4 rows were gathered against a misframed question — generic domain usability — rather than the intended ski-chalet pattern. The rows appeared to produce useful V4 evidence, but they were testing the wrong thing. A system that passes a usability check when a strong assistant operates directly is not the same as a system that passes when the strong assistant must work through the local model.

When the framing error was caught, the misframed rows were reclassified to control-only status. A second instance: rows were labeled SOFT_PASS rather than PASS precisely to prevent the same error — the local model recovered correctly, but it did not stay on the Phoenix tool surface autonomously, so an unqualified pass label would have overstated the result.

Second-Layer Evidence

Supervisor notes confirm the explicit reclassification decision and the reasoning: the direct-control rows could not stand as ski-chalet evidence because the experimental question was different. The SOFT_PASS label was an explicit supervisor decision, recorded to prevent the result from being read as autonomous stability when it was not.

Supervisor Response

Reclassified misframed rows as control-only evidence; corrected the experimental frame; re-ran rows explicitly inside the ski-chalet setup; applied SOFT_PASS rather than PASS where autonomous stability was not demonstrated.

Preventive Standard

Freeze the experimental frame before gathering rows. Framing corrections require explicit reclassification, not re-labeling — a row cannot be promoted retroactively into evidence for a question it was not designed to test.

Case 4: Bad Context Selection

evidenced

Source

docs/RUN_TRACE_PROOF_OF_USE_001.md; ShowcaseAgent routing benchmark

Incident

During forced-LLM routing evaluation of ShowcaseAgent, the trace layer surfaced two structurally coherent routing failures — correct query intent, wrong domain selected:

"explain lasso vs ridge regression" → routed to WQ; expected Stan
"show teaching complexity score" → routed to Stan; expected ParableAgent

The system was not confused or incoherent. It selected a plausible domain, committed to it, and the run proceeded. The trace note records: "These do not look like random failures. They look like domain-boundary ambiguity."

Without the trace layer, these would have remained as isolated row failures in the benchmark CSV. The trace grouped them under a single failure code (routing_error) and exposed them as a replayable miss family. The useful unit was not "fix two rows independently" but "inspect the router seam between Stan vs WQ and ParableAgent vs Stan."

Second-Layer Evidence

The run trace summary artifact preserved the exact fields needed for replay: run_id, scenario_id, input_query, selected_route, raw_failure_reason. The operational log confirming the miss family was systematic rather than random.

Supervisor Response

The trace layer was built specifically to surface this class of failure. The response was not to fix individual rows but to identify the router seam as the repair target. No re-run was attempted until the seam boundary was understood.

Preventive Standard

A structured trace layer can convert isolated benchmark misses into replayable miss families. Bad context selection is not always visible as a failure — the run completes, the output is internally coherent — so a trace layer that preserves selection reasoning is the earliest detection point.

Case 5: Doom Loops

evidenced

Source

docs/DOOM_LOOP_ESCALATION_RECORD.md; experiments/tennis/AGENT_COMMUNICATION_LOG.jsonl

Incident (Formal)

The DOOM_LOOP_ESCALATION_RECORD.md documents a formal escalation structure built in response to an observed pattern where an agent was repeating work without meaningful state gain. A representative logged escalation:

suspicion_reason:    "Repeated retries with no visible state gain"
recent_actions:      reran protocol probe → reopened same artifact
                     → retried same patch idea
worker_self_summary: "I am retrying the same fix path and not
                     learning anything new from the reruns."
state_gain:          none
supervisor_decision: change_strategy
rationale:           "Preserve artifacts and switch to trace-first diagnosis."

Incident (Operational)

The tennis communication log records a related pattern at a lower intensity. A validator-interface rerun was needed after repeated --task_id flags only exercised the last task — the same command re-executed with the same flag repeatedly without correction, producing the same partial result each time. The supervisor subsequently corrected manual_interventions from 0 to 1 to reflect the required intervention.

A second instance across ski-chalet run logs: every use-check query required local-only schema correction before the plan stayed on the deterministic MatchAgent tool surface — a near-loop where each query required the same correction before the agent could proceed.

Second-Layer Evidence

The doom loop escalation format itself is the operational artifact — it was created specifically because the pattern was recurring enough to need a durable escalation record. Communication log timestamps are the second layer confirming real instances.

Supervisor Response

dejavu flags the suspected loop; supervisor takes control; implementation becomes the explicit execution lane. Artifacts are preserved before the strategy changes. The escalation record structure ensures the reasoning is recorded outside git.

Preventive Standard

Do not record every hesitation or retry as a doom loop — the threshold is no visible state gain across meaningful action steps. When that threshold is crossed, preserve current artifacts first, then change strategy. The worker's self-summary is the key diagnostic: if it cannot describe what changed, the loop is confirmed.

Case 6: Premature Closure

evidenced

Source

experiments/tennis/AGENT_COMMUNICATION_LOG.jsonl; V3 bounded execution spec directive

Incident (INVALID_CLAIM)

On 2026-03-12, implementation logged run 20260312T013903Z as INVALID_CLAIM:

"It does not strengthen the chalet capability result; it strengthens
the runbook by exposing a false-positive validator path and missing
use-check artifact preservation."

The run passed the surface validator check. The trusted execution path — full artifact preservation through the use-check phase — did not complete. The result looked done; it was not trustworthy.

Incident (PARTIAL Runs)

Two runs were subsequently recorded as PARTIAL rather than allowed to carry any completion status:

Run 20260312T033426Z: MatchAgent V1 and V2 both passed staged-artifact validation, but the use check stopped before any complete tool_results/final_answer artifacts were preserved. Validation passed; the use phase did not finish.
Run 20260312T042815Z: progress reached only V1 raw artifact generation before the run stopped. No staged validation, V2 generation, or use-check execution occurred.

Both runs had real progress. Neither was trustworthy as a result. The supervisor named them PARTIAL rather than accepting surface-level indicators as completion.

Incident (Spec-Level Prevention)

Before V3 runs were executed, the supervisor issued a blocking directive: "runtime-based PARTIAL must require an objective trigger threshold rather than discretionary declaration" and "claimed completed prefix must be revalidated on the final artifact version for the row." The spec was tightened before the pattern could corrupt V3 data.

Second-Layer Evidence

The communication log timestamps are the operational record. The INVALID_CLAIM classification, the PARTIAL labels, and the V3 spec directive are all explicit supervisor judgments recorded outside git.

Supervisor Response

Any run that passed a surface check but lacked full artifact preservation was reclassified — never promoted to completion status. INVALID_CLAIM was used when the false-positive path needed to be preserved as a runbook lesson. PARTIAL was used when real progress existed but the result was not trustworthy.

Preventive Standard

Surface validator exit code is not sufficient for completion. A result is complete only when the full trusted execution path — including artifact preservation through the final phase — has been confirmed. Stopping discipline must be objective: a PARTIAL threshold requires a defined trigger, not a discretionary call.

Summary Table

Failure Family	Observed Incident	Supervisor Response	Preventive Standard
Tool-surface drift	V4 ski-chalet: local model proposed Wikipedia / invented tools in all 3 rows despite correct user intent	One structured redirect per row: candidate domain + candidate tool set; no semantic takeover	Hold candidate-tool list at session start; make execution surface explicit in harness prompt
Summit fever	Pre-clean-matrix benchmark expanded into larger packs and runtime confounds after construction signal was already sufficient	Introduced explicit stop conditions; separated V3 as a bounded 12-task escalation with fixed scope	Define stop conditions before the experiment starts; scope corrections mid-run require reclassification
False success / validator-path mismatch	Initial V4 rows gathered against misframed question (generic usability vs ski-chalet pattern); appeared to confirm progress but tested the wrong thing	Reclassified misframed rows as control-only; re-ran true ski-chalet rows; applied SOFT_PASS where autonomous stability was not demonstrated	Freeze the experimental frame before gathering rows; a row cannot be promoted retroactively into evidence for a different question
Bad context selection	ShowcaseAgent routing: "lasso vs ridge" → WQ (expected Stan); "teaching complexity score" → Stan (expected ParableAgent) — intent correct, domain wrong	Trace layer surfaced both as a routing_error family; repair target identified as the router seam, not individual rows	Build a trace layer that preserves selection reasoning; isolated row failures may be a miss family in disguise
Doom loops	Same patch retried repeatedly with no state gain; validator-interface repeated with same flag returning same partial result; schema correction required on every use-check query	dejavu flags the loop; supervisor takes control; artifacts preserved before strategy changes	Threshold is no visible state gain across meaningful actions; preserve artifacts first, then change strategy — do not retry the same path
Premature closure	Run 20260312T013903Z passed surface validator but lacked use-check artifact preservation → INVALID_CLAIM; two PARTIAL runs had V1/V2 pass but use phase incomplete	INVALID_CLAIM preserved the false-positive path as a runbook lesson; PARTIAL labels prevented incomplete runs from carrying completion status	Surface validator exit code is not completion — full artifact preservation through the final phase is required; PARTIAL threshold must be objective, not discretionary

Why Lessons Learned Are Not Optional

The Phoenix material points toward a specific operating discipline:

Explicit stop conditions
Supervisor and implementation separation
Bounded experiments with fixed scope
Artifact-preservation rules with objective thresholds
Invalid-claim reclassification rather than re-labeling
Deterministic validation surfaces, not surface-level checks
Communication logs outside git as a second historical layer

These practices matter because accountability is human. The team that selects the model, builds the harness, defines the process, and accepts the output owns the operational result.

Lessons learned and standards are not optional. They are the primary path to trustworthy use.