Where Orchestration Beats Raw Model Power

3 Addendum papers synthesized

5 Operating regimes mapped

100% Deterministic routing accuracy

0/10 Grounded wrong-or-missing answers

Central conclusion: once hardware crosses a competence threshold, the organized operating stack may settle the outcome before raw model size alone does — in identifiable regimes, using three converging Phoenix results as evidence.

The Question That Changed

After the Project Phoenix local-model addendum, the useful question shifted.

The old question was: which local model won?

The new question is: where does better orchestration beat more raw model power?

This is a boundary question — broader than any one benchmark row, narrower than a systems manifesto. The paper addresses it by mapping three existing Phoenix results onto five operating regimes and asking, for each regime, what actually settled the outcome.

Executive Summary

The Project Phoenix local-model addendum produced three linked results. TourAgent showed that grounding removes wrong-or-missing answers faster than it creates exactness. ShowcaseAgent showed that routing and compression become useful before broad direct-tool execution becomes uniform. Local Model Role Suitability showed that different local models win in different roles and policy regimes, with smaller models already sufficient for meaningful routing and some grounded use.

Taken together, the results support a boundary claim rather than a slogan: once hardware is good enough, the organized operating stack may settle the outcome before raw model size alone does. This is not a claim that small beats large. It is a claim that the dominant factor changes by operating regime — and that raw model power is only one column in the result table.

This paper maps those results onto five operating regimes and explains what orchestration means concretely in each.

Context

Project Phoenix accumulated three local-model addendum papers across 2025–2026. Each was a standalone evaluation: TourAgent tested grounded domain answering, ShowcaseAgent tested meta-tool routing and compression, and Local Model Role Suitability tested how policy and role selection shifted the model-fit boundary.

All three papers produced useful standalone results. But they were not yet placed beside each other. When they are, the synthesis question becomes unavoidable: what do these results add up to?

The hardware backdrop matters too. A high-end local GPU (RTX 3090) was available throughout the experiments, but the results did not depend on it being the decisive edge. The laptop lane in the role suitability paper showed lower-threshold usefulness was already real for some roles. This changes the framing: hardware is a threshold condition, not a destiny.

The Three Addendum Papers

TourAgent

Grounded domain answering. Tested whether a deterministic substrate removes outright failure before exactness improves. Two models evaluated raw and grounded.

See results →

ShowcaseAgent

Meta-tool routing and compression. Tested whether routing across a multi-domain portfolio becomes useful before every direct-tool path is uniform.

See results →

Role Suitability

Policy and role selection. Tested how model-fit shifts when roles are separated and repair policy is made explicit. Five models, three policy levels, laptop lane.

See results →

What Orchestration Means Here

The word "orchestration" can become vague if left alone. In this paper it means the explicit organization of six things — in this order:

Deterministic substrate selection
Routing layers
Grounding layers
Escalation rules
Repair rules
Model assignment by role

And only then: model power.

That ordering is the paper's main claim. It is not about smaller models versus larger ones. It is about system shape.

Scope and Non-Claims

This paper does not claim:

Universal local-model rankings
That small always beats large
That repair layers solve every downstream problem
That one benchmark regime transfers everywhere unchanged
That model quality does not matter

It claims something narrower: the system may settle the result before raw model power alone does, in identifiable regimes, using the evidence from three Phoenix papers.

Navigation

Evidence

The three addendum experiment results in full: TourAgent grounding numbers, ShowcaseAgent routing tables, and the Role Suitability laptop-lane comparison.

Open Evidence →

Regime Map

The five operating regimes, the dominant factor in each, and the full interpretation — including what still separates stronger models when system shape is held equal.

Open Regime Map →