Project Phoenix · Regime Map

The Regime Map

Five operating regimes. For each: what settled the result, and what still separates stronger models.

The regime map is the paper's core analytical artifact. It does not invent new evidence — it reads the existing evidence in a structured way, asking for each regime: was it raw model power, or system shape, that settled the result?

The Five Regimes

Each regime is defined by a single main question that the system must answer under real operating constraints.

Regime Main success question Dominant factor so far What still separates stronger models
Routing and compression Can the system choose the right domain cleanly? Routing structure and compression surface Shared semantic boundary cases and family-specific routing quality
Grounded single-domain use Can the system remove outright failure? Grounding and deterministic substrate quality Higher exactness and less answer softening
Exactness-sensitive handoff Can the answer survive a machine-facing contract without repair? Model fidelity under protocol pressure Strict schema adherence and lower reject rate
Repair-assisted pipeline How much gap disappears once tiny trusted repair is allowed? Explicit repair policy How much semantic drift remains after safe repair is exhausted
Laptop-threshold operation How much usefulness survives after hardware drops? Role fit and threshold-appropriate model choice Better low-repair performance and wider stable operating envelope

This is a synthesis map, not a universal ranking. The "dominant factor" column describes what the Phoenix evidence shows settled the result in each regime. It does not claim these regimes are exhaustive or that the mapping transfers unchanged to other domains.

Regime by Regime

Routing and Compression

ShowcaseAgent showed that routing and domain compression across a multi-domain portfolio became useful before every direct-tool execution path became uniform. Three models of substantially different sizes reached 39/41 on the same routing set. The two misses were shared semantic boundary cases — not a function of raw model size. The system was navigationally useful before execution symmetry existed.

The dominant factor was routing structure and compression surface. Stronger models are still differentiated by how they handle shared semantic boundary cases, but the baseline routing task was already settled by structure.

Grounded Single-Domain Use

TourAgent showed that grounding a local model against a deterministic domain substrate moved both tested models from non-zero wrong-or-missing rates to zero — regardless of their size gap. gemma3:27b went from 4/10 to 0/10; qwen2.5-coder:32b went from 3/10 to 0/10.

The dominant factor was grounding and substrate quality. What still separates stronger models is higher exactness and less answer softening above the failure floor — but the floor itself was set by the harness.

Exactness-Sensitive Handoff

The exactness-sensitive handoff regime is the one where raw model quality still binds most directly. When the downstream consumer is a machine-facing contract with no repair budget, schema adherence and reject rate matter — and these vary by model in ways that the other regimes do not expose.

This is the regime where "upgrade the model" is the most defensible prescription. Orchestration helps — escalation rules and routing can reduce how often the system hits this path — but once inside it, model fidelity under protocol pressure is the dominant factor.

Repair-Assisted Pipeline

The Role Suitability laptop lane showed that a tiny trusted repair layer moved llama3.1:8b from 1/6 to 5/6 under safe repair — the same score as qwen2.5:14b under the same policy. The practical inter-model gap narrowed more from the repair layer than it would have from a model upgrade for tasks inside the repair boundary.

What still separates stronger models is semantic drift that falls outside the safe-repair scope. gpt-oss:20b failed to benefit at all because its errors were not safe-repairable. That is a structural limit of repair policy, not a reason to abandon it.

Laptop-Threshold Operation

The laptop lane also showed that usefulness survives hardware reduction when role and model are matched to the threshold. The hardware is a threshold condition. Once crossed, the question becomes which role and which model fit — not whether to add more GPU.

The RTX 3090 was available throughout Phoenix experiments, but the results did not depend on it being the decisive edge. Lower-threshold hardware with the right role assignment already showed real usefulness.

Interpretation

The three results converge on the same lesson: the dominant factor changes by regime.

In routing and compression, the system became useful when the routing structure was organized — not when the strongest model was selected. In grounded single-domain use, the harness changed the answer surface before the model became exact. In repair-assisted pipelines, explicit policy narrowed the inter-model gap more than adding headroom did. In laptop-threshold operation, role fit and model selection against threshold mattered more than the GPU tier.

The hardware conclusion follows from the same logic. A high-end local GPU can be enough without being the decisive edge forever. Once a competence threshold is crossed, organization often matters more than further raw headroom.

An Outside Analogy

A recent RTX 3090 comparison between a PyTorch path and a fused Numba CUDA kernel produced a result that looked backward on paper: the fused custom kernel lost badly, and an unhealthy CUDA environment had to be discarded before the comparison was even valid. That is not Phoenix evidence — it is a systems reminder that a theoretically stronger primitive can still lose to a better-organized stack.

The interpretive discipline is the same: trust the whole operating result, not just the elegant primitive.

The Stack Order

This is the operating hierarchy the paper's claim implies. The dominant factor in most Phoenix regimes so far appears at or before step 5 — model power is the last column, not the first.

1. Substrate

Deterministic domain substrate. The ground truth layer. What the model is grounded against.

2. Routing

The layer that decides which domain or tool handles the request. Structure here settles routing-regime outcomes.

3. Grounding

The connection between model and substrate. Grounding quality set the failure floor in TourAgent before model power mattered.

4. Escalation & Repair

Explicit policy for what happens when the model falls short. Repair policy closed the practical inter-model gap in the laptop lane.

5. Role Assignment

Matching model to role and hardware threshold. Role fit changed the result before model size did in laptop-lane operation.

6. Model Power

Raw model capability. Still the binding constraint in exactness-sensitive handoff. Not the first column in the other four regimes.

Why This Matters

Most local-model evaluation stops at a flat ranking: which model scored highest on a benchmark. That framing is incomplete once real operating constraints are included — downstream machine-facing contracts, hardware thresholds, repair budgets, and role requirements all change which model actually wins.

The regime map is the useful artifact: it tells a builder where to invest in system organization and where raw model quality is still the binding constraint. That is a more actionable output than a leaderboard row.

For Project Phoenix specifically, this paper is the synthesis step that promotes the three addendum papers from standalone evaluations into a coherent systems claim.

Conclusion

The main lesson of the addendum is not that one local model finally won. It is that the useful unit is not the naked model and not the harness in isolation. It is the organized operating stack.

That stack includes substrate, routing, grounding, repair policy, role selection — and only then model power.

That is where orchestration beats raw model power: not everywhere, not always, but in enough real regimes that flat model rankings are no longer a sufficient description of system quality.