Project Phoenix · Evidence

The Three Addendum Results

TourAgent, ShowcaseAgent, and Role Suitability — read together, not in isolation.

This is a synthesis paper, not a new benchmark program. These results existed before this paper. The contribution is placing them beside each other and reading what they add up to.

TourAgent — Harness Before Exactness

Investigation: Tested whether grounding a local model against a deterministic domain substrate removes outright failure before it produces artifact-tight fidelity. Two models evaluated raw and grounded against the same domain task set.

The question it answers: Does a grounding layer change the answer surface in a meaningful way, even if it does not immediately produce exact outputs?

Wrong-or-Missing Answer Rate: Raw vs. Grounded

Model Raw (wrong or missing / 10) Grounded (wrong or missing / 10) Change
gemma3:27b 4 / 10 0 / 10 −4 wrong-or-missing
qwen2.5-coder:32b 3 / 10 0 / 10 −3 wrong-or-missing

Grounding changed the answer surface before exactness improved. Reliability improved before artifact-tight fidelity appeared. The harness did the work, not a model upgrade.

What This Tells the Regime Map

In the grounded single-domain use regime, the dominant factor is grounding and deterministic substrate quality — not raw model power. Both models moved to zero failures. What still separates stronger models is higher exactness and less answer softening — but the failure floor was set by the harness, not the model tier.

ShowcaseAgent — Compression Before Broad Execution

Investigation: Tested whether routing and domain compression across a multi-domain portfolio could become useful before every direct-tool execution path became uniform. Four models evaluated against deterministic and model-driven routing paths.

The question it answers: Does the system become navigationally useful before execution symmetry exists across all direct-tool paths?

Routing Accuracy: Deterministic vs. Model-Driven

Routing path Model Correct / 41 Accuracy
Deterministic n/a 41 / 41 100%
Model-driven gemma3:27b 39 / 41 95%
Model-driven gpt-oss:20b 39 / 41 95%
Model-driven llama3.1:8b 39 / 41 95%
Model-driven qwen2.5-coder:32b 41 / 41 100%

Three models of substantially different sizes reached 39/41. The two missing items are semantic boundary cases shared across the models — not a function of raw model size. The system was navigationally useful long before execution symmetry existed across all direct-tool paths.

What This Tells the Regime Map

In the routing and compression regime, the dominant factor is routing structure and compression surface — not which model is strongest. What still separates stronger models is shared semantic boundary cases and family-specific routing quality, not the gap between an 8B and a 32B model on core routing tasks.

Local Model Role Suitability — Policy Changes the Boundary

Investigation: Tested how model-fit shifts when roles are separated (routing, grounded domain use, exactness-sensitive handoff) and when repair policy is made explicit. Five models across three policy levels and a laptop-threshold lane.

The question it answers: Does role and policy selection change the result before raw model power does?

Laptop Lane — Strict vs. Safe Repair Policy

The laptop lane used lower-threshold hardware, testing whether usefulness survives after the GPU tier drops.

Model Strict (/ 6) Safe repair (/ 6) Gap closed by repair
qwen2.5:14b 4 / 6 5 / 6 +1
llama3.1:8b 1 / 6 5 / 6 +4
gpt-oss:20b 0 / 6 0 / 6 no change

A tiny trusted repair layer moved llama3.1:8b from 1/6 to 5/6 — the same score as qwen2.5:14b under safe repair. Repair policy narrowed the practical inter-model gap more than adding raw model headroom did. gpt-oss:20b failed to benefit because its errors fell outside the safe-repair boundary.

What This Tells the Regime Map

Two regimes in the map are directly supported here. In the repair-assisted pipeline regime, explicit repair policy is the dominant factor — not the model tier. In the laptop-threshold operation regime, role fit and threshold-appropriate model choice are the dominant factors. Hardware is a threshold condition. Once crossed, organization matters more than further headroom.

What the Three Results Add Up To

Each result, read alone, is a useful data point. Read together, they converge on the same lesson from three different angles:

Harness → failure floor

TourAgent: grounding set the failure floor. Both models moved to zero failures regardless of their size gap.

Structure → routing quality

ShowcaseAgent: routing structure drove accuracy. An 8B model matched a 32B model on core routing — both hit the same semantic boundary.

Policy → inter-model gap

Role Suitability: a repair layer closed the gap between models more than a model upgrade would have, for the tasks inside its scope.

The synthesis claim follows: in these regimes, system shape settled the result before raw model power did. The dominant factor was not the model — it was the organization around the model.

See the full regime map →