TourAgent, ShowcaseAgent, and Role Suitability — read together, not in isolation.
This is a synthesis paper, not a new benchmark program. These results existed before this paper. The contribution is placing them beside each other and reading what they add up to.
Investigation: Tested whether grounding a local model against a deterministic domain substrate removes outright failure before it produces artifact-tight fidelity. Two models evaluated raw and grounded against the same domain task set.
The question it answers: Does a grounding layer change the answer surface in a meaningful way, even if it does not immediately produce exact outputs?
| Model | Raw (wrong or missing / 10) | Grounded (wrong or missing / 10) | Change |
|---|---|---|---|
| gemma3:27b | 4 / 10 | 0 / 10 | −4 wrong-or-missing |
| qwen2.5-coder:32b | 3 / 10 | 0 / 10 | −3 wrong-or-missing |
Grounding changed the answer surface before exactness improved. Reliability improved before artifact-tight fidelity appeared. The harness did the work, not a model upgrade.
In the grounded single-domain use regime, the dominant factor is grounding and deterministic substrate quality — not raw model power. Both models moved to zero failures. What still separates stronger models is higher exactness and less answer softening — but the failure floor was set by the harness, not the model tier.
Investigation: Tested whether routing and domain compression across a multi-domain portfolio could become useful before every direct-tool execution path became uniform. Four models evaluated against deterministic and model-driven routing paths.
The question it answers: Does the system become navigationally useful before execution symmetry exists across all direct-tool paths?
| Routing path | Model | Correct / 41 | Accuracy |
|---|---|---|---|
| Deterministic | n/a | 41 / 41 | 100% |
| Model-driven | gemma3:27b | 39 / 41 | 95% |
| Model-driven | gpt-oss:20b | 39 / 41 | 95% |
| Model-driven | llama3.1:8b | 39 / 41 | 95% |
| Model-driven | qwen2.5-coder:32b | 41 / 41 | 100% |
Three models of substantially different sizes reached 39/41. The two missing items are semantic boundary cases shared across the models — not a function of raw model size. The system was navigationally useful long before execution symmetry existed across all direct-tool paths.
In the routing and compression regime, the dominant factor is routing structure and compression surface — not which model is strongest. What still separates stronger models is shared semantic boundary cases and family-specific routing quality, not the gap between an 8B and a 32B model on core routing tasks.
Investigation: Tested how model-fit shifts when roles are separated (routing, grounded domain use, exactness-sensitive handoff) and when repair policy is made explicit. Five models across three policy levels and a laptop-threshold lane.
The question it answers: Does role and policy selection change the result before raw model power does?
The laptop lane used lower-threshold hardware, testing whether usefulness survives after the GPU tier drops.
| Model | Strict (/ 6) | Safe repair (/ 6) | Gap closed by repair |
|---|---|---|---|
| qwen2.5:14b | 4 / 6 | 5 / 6 | +1 |
| llama3.1:8b | 1 / 6 | 5 / 6 | +4 |
| gpt-oss:20b | 0 / 6 | 0 / 6 | no change |
A tiny trusted repair layer moved llama3.1:8b from 1/6 to 5/6 — the same score as qwen2.5:14b under safe repair. Repair policy narrowed the practical inter-model gap more than adding raw model headroom did. gpt-oss:20b failed to benefit because its errors fell outside the safe-repair boundary.
Two regimes in the map are directly supported here. In the repair-assisted pipeline regime, explicit repair policy is the dominant factor — not the model tier. In the laptop-threshold operation regime, role fit and threshold-appropriate model choice are the dominant factors. Hardware is a threshold condition. Once crossed, organization matters more than further headroom.
Each result, read alone, is a useful data point. Read together, they converge on the same lesson from three different angles:
TourAgent: grounding set the failure floor. Both models moved to zero failures regardless of their size gap.
ShowcaseAgent: routing structure drove accuracy. An 8B model matched a 32B model on core routing — both hit the same semantic boundary.
Role Suitability: a repair layer closed the gap between models more than a model upgrade would have, for the tasks inside its scope.
The synthesis claim follows: in these regimes, system shape settled the result before raw model power did. The dominant factor was not the model — it was the organization around the model.
See the full regime map →