Five operating regimes. For each: what settled the result, and what still separates stronger models.
The regime map is the paper's core analytical artifact. It does not invent new evidence — it reads the existing evidence in a structured way, asking for each regime: was it raw model power, or system shape, that settled the result?
Each regime is defined by a single main question that the system must answer under real operating constraints.
| Regime | Main success question | Dominant factor so far | What still separates stronger models |
|---|---|---|---|
| Routing and compression | Can the system choose the right domain cleanly? | Routing structure and compression surface | Shared semantic boundary cases and family-specific routing quality |
| Grounded single-domain use | Can the system remove outright failure? | Grounding and deterministic substrate quality | Higher exactness and less answer softening |
| Exactness-sensitive handoff | Can the answer survive a machine-facing contract without repair? | Model fidelity under protocol pressure | Strict schema adherence and lower reject rate |
| Repair-assisted pipeline | How much gap disappears once tiny trusted repair is allowed? | Explicit repair policy | How much semantic drift remains after safe repair is exhausted |
| Laptop-threshold operation | How much usefulness survives after hardware drops? | Role fit and threshold-appropriate model choice | Better low-repair performance and wider stable operating envelope |
This is a synthesis map, not a universal ranking. The "dominant factor" column describes what the Phoenix evidence shows settled the result in each regime. It does not claim these regimes are exhaustive or that the mapping transfers unchanged to other domains.
ShowcaseAgent showed that routing and domain compression across a multi-domain portfolio became useful before every direct-tool execution path became uniform. Three models of substantially different sizes reached 39/41 on the same routing set. The two misses were shared semantic boundary cases — not a function of raw model size. The system was navigationally useful before execution symmetry existed.
The dominant factor was routing structure and compression surface. Stronger models are still differentiated by how they handle shared semantic boundary cases, but the baseline routing task was already settled by structure.
TourAgent showed that grounding a local model against a deterministic domain substrate moved both tested models from non-zero wrong-or-missing rates to zero — regardless of their size gap. gemma3:27b went from 4/10 to 0/10; qwen2.5-coder:32b went from 3/10 to 0/10.
The dominant factor was grounding and substrate quality. What still separates stronger models is higher exactness and less answer softening above the failure floor — but the floor itself was set by the harness.
The exactness-sensitive handoff regime is the one where raw model quality still binds most directly. When the downstream consumer is a machine-facing contract with no repair budget, schema adherence and reject rate matter — and these vary by model in ways that the other regimes do not expose.
This is the regime where "upgrade the model" is the most defensible prescription. Orchestration helps — escalation rules and routing can reduce how often the system hits this path — but once inside it, model fidelity under protocol pressure is the dominant factor.
The Role Suitability laptop lane showed that a tiny trusted repair layer moved llama3.1:8b from 1/6 to 5/6 under safe repair — the same score as qwen2.5:14b under the same policy. The practical inter-model gap narrowed more from the repair layer than it would have from a model upgrade for tasks inside the repair boundary.
What still separates stronger models is semantic drift that falls outside the safe-repair scope. gpt-oss:20b failed to benefit at all because its errors were not safe-repairable. That is a structural limit of repair policy, not a reason to abandon it.
The laptop lane also showed that usefulness survives hardware reduction when role and model are matched to the threshold. The hardware is a threshold condition. Once crossed, the question becomes which role and which model fit — not whether to add more GPU.
The RTX 3090 was available throughout Phoenix experiments, but the results did not depend on it being the decisive edge. Lower-threshold hardware with the right role assignment already showed real usefulness.
The three results converge on the same lesson: the dominant factor changes by regime.
In routing and compression, the system became useful when the routing structure was organized — not when the strongest model was selected. In grounded single-domain use, the harness changed the answer surface before the model became exact. In repair-assisted pipelines, explicit policy narrowed the inter-model gap more than adding headroom did. In laptop-threshold operation, role fit and model selection against threshold mattered more than the GPU tier.
The hardware conclusion follows from the same logic. A high-end local GPU can be enough without being the decisive edge forever. Once a competence threshold is crossed, organization often matters more than further raw headroom.
A recent RTX 3090 comparison between a PyTorch path and a fused Numba CUDA kernel produced a result that looked backward on paper: the fused custom kernel lost badly, and an unhealthy CUDA environment had to be discarded before the comparison was even valid. That is not Phoenix evidence — it is a systems reminder that a theoretically stronger primitive can still lose to a better-organized stack.
The interpretive discipline is the same: trust the whole operating result, not just the elegant primitive.
This is the operating hierarchy the paper's claim implies. The dominant factor in most Phoenix regimes so far appears at or before step 5 — model power is the last column, not the first.
Deterministic domain substrate. The ground truth layer. What the model is grounded against.
The layer that decides which domain or tool handles the request. Structure here settles routing-regime outcomes.
The connection between model and substrate. Grounding quality set the failure floor in TourAgent before model power mattered.
Explicit policy for what happens when the model falls short. Repair policy closed the practical inter-model gap in the laptop lane.
Matching model to role and hardware threshold. Role fit changed the result before model size did in laptop-lane operation.
Raw model capability. Still the binding constraint in exactness-sensitive handoff. Not the first column in the other four regimes.
Most local-model evaluation stops at a flat ranking: which model scored highest on a benchmark. That framing is incomplete once real operating constraints are included — downstream machine-facing contracts, hardware thresholds, repair budgets, and role requirements all change which model actually wins.
The regime map is the useful artifact: it tells a builder where to invest in system organization and where raw model quality is still the binding constraint. That is a more actionable output than a leaderboard row.
For Project Phoenix specifically, this paper is the synthesis step that promotes the three addendum papers from standalone evaluations into a coherent systems claim.
The main lesson of the addendum is not that one local model finally won. It is that the useful unit is not the naked model and not the harness in isolation. It is the organized operating stack.
That stack includes substrate, routing, grounding, repair policy, role selection — and only then model power.
That is where orchestration beats raw model power: not everywhere, not always, but in enough real regimes that flat model rankings are no longer a sufficient description of system quality.