Evidence | Where Orchestration Beats Raw Model Power

TourAgent — Harness Before Exactness

Investigation: Tested whether grounding a local model against a deterministic domain substrate removes outright failure before it produces artifact-tight fidelity. Two models evaluated raw and grounded against the same domain task set.

The question it answers: Does a grounding layer change the answer surface in a meaningful way, even if it does not immediately produce exact outputs?

Wrong-or-Missing Answer Rate: Raw vs. Grounded

Model	Raw (wrong or missing / 10)	Grounded (wrong or missing / 10)	Change
gemma3:27b	4 / 10	0 / 10	−4 wrong-or-missing
qwen2.5-coder:32b	3 / 10	0 / 10	−3 wrong-or-missing

Grounding changed the answer surface before exactness improved. Reliability improved before artifact-tight fidelity appeared. The harness did the work, not a model upgrade.

What This Tells the Regime Map

In the grounded single-domain use regime, the dominant factor is grounding and deterministic substrate quality — not raw model power. Both models moved to zero failures. What still separates stronger models is higher exactness and less answer softening — but the failure floor was set by the harness, not the model tier.

ShowcaseAgent — Compression Before Broad Execution

Investigation: Tested whether routing and domain compression across a multi-domain portfolio could become useful before every direct-tool execution path became uniform. Four models evaluated against deterministic and model-driven routing paths.

The question it answers: Does the system become navigationally useful before execution symmetry exists across all direct-tool paths?

Routing Accuracy: Deterministic vs. Model-Driven

Routing path	Model	Correct / 41	Accuracy
Deterministic	n/a	41 / 41	100%
Model-driven	gemma3:27b	39 / 41	95%
Model-driven	gpt-oss:20b	39 / 41	95%
Model-driven	llama3.1:8b	39 / 41	95%
Model-driven	qwen2.5-coder:32b	41 / 41	100%

Three models of substantially different sizes reached 39/41. The two missing items are semantic boundary cases shared across the models — not a function of raw model size. The system was navigationally useful long before execution symmetry existed across all direct-tool paths.

What This Tells the Regime Map

In the routing and compression regime, the dominant factor is routing structure and compression surface — not which model is strongest. What still separates stronger models is shared semantic boundary cases and family-specific routing quality, not the gap between an 8B and a 32B model on core routing tasks.

Local Model Role Suitability — Policy Changes the Boundary

Investigation: Tested how model-fit shifts when roles are separated (routing, grounded domain use, exactness-sensitive handoff) and when repair policy is made explicit. Five models across three policy levels and a laptop-threshold lane.

The question it answers: Does role and policy selection change the result before raw model power does?

Laptop Lane — Strict vs. Safe Repair Policy

The laptop lane used lower-threshold hardware, testing whether usefulness survives after the GPU tier drops.

Model	Strict (/ 6)	Safe repair (/ 6)	Gap closed by repair
qwen2.5:14b	4 / 6	5 / 6	+1
llama3.1:8b	1 / 6	5 / 6	+4
gpt-oss:20b	0 / 6	0 / 6	no change

A tiny trusted repair layer moved llama3.1:8b from 1/6 to 5/6 — the same score as qwen2.5:14b under safe repair. Repair policy narrowed the practical inter-model gap more than adding raw model headroom did. gpt-oss:20b failed to benefit because its errors fell outside the safe-repair boundary.

What This Tells the Regime Map

Two regimes in the map are directly supported here. In the repair-assisted pipeline regime, explicit repair policy is the dominant factor — not the model tier. In the laptop-threshold operation regime, role fit and threshold-appropriate model choice are the dominant factors. Hardware is a threshold condition. Once crossed, organization matters more than further headroom.

What the Three Results Add Up To

Each result, read alone, is a useful data point. Read together, they converge on the same lesson from three different angles:

Harness → failure floor

TourAgent: grounding set the failure floor. Both models moved to zero failures regardless of their size gap.

Structure → routing quality

ShowcaseAgent: routing structure drove accuracy. An 8B model matched a 32B model on core routing — both hit the same semantic boundary.

Policy → inter-model gap

Role Suitability: a repair layer closed the gap between models more than a model upgrade would have, for the tasks inside its scope.

The synthesis claim follows: in these regimes, system shape settled the result before raw model power did. The dominant factor was not the model — it was the organization around the model.

See the full regime map →