TourAgent
Grounded single-domain result
A deterministic tennis substrate changed the answer surface before raw local-model exactness improved. The result: grounding removed wrong-or-missing answers faster than it created artifact-tight fidelity.
Project Phoenix
Three narrower papers sit underneath the flagship Phoenix claims. Together they support a more practical local-model view: harnesses, routing surfaces, protocol discipline, repair policy, and role fit often settle usefulness before raw model strength alone does.
The details layer
Grounded single-domain result
A deterministic tennis substrate changed the answer surface before raw local-model exactness improved. The result: grounding removed wrong-or-missing answers faster than it created artifact-tight fidelity.
Routing and compression result
Portfolio-scale local usefulness may emerge first in routing and capability compression before broad direct-tool execution becomes uniform across models.
Role-boundary and policy result
Smaller local models are already enough for meaningful roles, but low-repair machine-facing work remains a sharper threshold. The useful question is not which model won in general, but which model is good enough for which operating role.
Top-level idea
Container
The full operating wrapper around the model: prompt structure, tools, grounding, validators, repair rules, routing, and output constraints. In Phoenix, the harness is the system the model lives inside, not a prompt accessory.
Claim
Once hardware is good enough, outcomes are often settled by how the system is organized before they are settled by raw model size alone.
Definition
What components exist in the system: substrate, grounding layer, tools, validators, repair path, routing surface, model assignments.
How those components are coordinated at runtime: what order they run in, which layer owns what, when the system routes, grounds, validates, repairs, or escalates, and where correctness is supposed to live.
Inside the harness
Injecting verified external reality into the answer path: evidence bundles, fixed datasets, retrieved facts, deterministic artifacts.
External actions the model can call: retrieval, code execution, validators, file access, simulators, search.
The rules output must satisfy: exact JSON, schema shape, citation requirements, sequencing rules, downstream contracts.
Choosing which model, tool, or path handles a task. Routing is assignment, not answer generation.
Reducing an open problem into a smaller, more reliable decision surface the system can handle consistently.
Matching models to jobs they are actually suited for: routing, grounded response, exactness-sensitive handoff, repair-assisted work.
What the system allows after failure: wrapper stripping, safe repair, retries, fallback escalation, trusted extraction.
What becomes operationally viable on a laptop, workstation, or stronger machine once memory, latency, and stability cross practical thresholds.
Important distinction
A model without native tools can still be useful if the Phoenix wrapper supplies retrieval, validation, or execution surfaces.
A model that cannot stay inside strict contracts creates downstream breakage even if it is semantically smart. Weak protocol discipline is harder to wrap away.
Concrete examples
Environment weakness
Gemma3 may lack the native tool surface you want, but Phoenix can compensate by wrapping it in retrieval, validators, and grounded domain paths.
Missing tools can often be compensated for by the harness.
Measurement integrity first
The original gemma4 protocol failures were a capture artifact: ollama run
subprocess capture embeds VT100 terminal sequences that corrupt multi-line JSON for
thinking-mode models. Under clean REST API capture, gemma4:31b passes all six protocol
probes — the strongest local result on this lane. gemma4:26b passes with
/no_think suppression.
The harness boundary matters for measurement, not just for deployment. Invalid capture produces invalid rankings.
Phoenix regime map
Dominant factor: routing structure and compression surface.
Dominant factor: grounding quality and deterministic substrate quality.
Dominant factor: protocol compliance under pressure.
Dominant factor: explicit repair policy.
Dominant factor: role fit and threshold-appropriate model choice.
Operator shell pattern
Shell outside, authority inside
OpenClaw provides the operator access layer: five HTTP surfaces, a hardening gate, and an incident workflow. Phoenix provides what makes those surfaces trustworthy: deterministic substrates, solver-backed outputs, grounded domain paths. The authority boundary does not move regardless of which use case is active.
An operator shell is not the authority layer and should never become one.
Measurement, discipline, decision surface, demo
Seven use cases across three categories prove the pattern holds under load. Rule-based routing delivers 41/41 accuracy vs 39/41 forced-LLM. Substrate coverage accounts for 82.4% of harness feature importance. Every measurement routes to a Phoenix backend; every demo shows the deterministic layer outperforming the model in its own domain.
The shell stays outside. The authority stays inside. Correctness stays in the deterministic layer.
One-line thesis
Do not ask only which model is strongest. Ask which layer is carrying the result: grounding, routing, protocol discipline, repair policy, role fit, or raw model power.
Next stop
These details are the evidence layer. The broader synthesis claim lives in Where Orchestration Beats Raw Model Power.