Project Phoenix

Local Model Details

Three narrower papers sit underneath the flagship Phoenix claims. Together they support a more practical local-model view: harnesses, routing surfaces, protocol discipline, repair policy, and role fit often settle usefulness before raw model strength alone does.

The Three Papers

TourAgent

Grounded single-domain result

A deterministic tennis substrate changed the answer surface before raw local-model exactness improved. The result: grounding removed wrong-or-missing answers faster than it created artifact-tight fidelity.

Open TourAgent →

ShowcaseAgent

Routing and compression result

Portfolio-scale local usefulness may emerge first in routing and capability compression before broad direct-tool execution becomes uniform across models.

Open ShowcaseAgent →

Local Model Role Suitability

Role-boundary and policy result

Smaller local models are already enough for meaningful roles, but low-repair machine-facing work remains a sharper threshold. The useful question is not which model won in general, but which model is good enough for which operating role.

Open Role Suitability →

Harness Contains The Layers

Claim

System Shape

Once hardware is good enough, outcomes are often settled by how the system is organized before they are settled by raw model size alone.

What Orchestration Means Here

Architecture

What components exist in the system: substrate, grounding layer, tools, validators, repair path, routing surface, model assignments.

Orchestration

How those components are coordinated at runtime: what order they run in, which layer owns what, when the system routes, grounds, validates, repairs, or escalates, and where correctness is supposed to live.

Phoenix Operating Layers

Reality Layer

Grounding

Injecting verified external reality into the answer path: evidence bundles, fixed datasets, retrieved facts, deterministic artifacts.

Capability Layer

Tools

External actions the model can call: retrieval, code execution, validators, file access, simulators, search.

Constraint Layer

Protocols

The rules output must satisfy: exact JSON, schema shape, citation requirements, sequencing rules, downstream contracts.

Selection Layer

Routing

Choosing which model, tool, or path handles a task. Routing is assignment, not answer generation.

Surface Layer

Compression

Reducing an open problem into a smaller, more reliable decision surface the system can handle consistently.

Allocation Layer

Role Selection

Matching models to jobs they are actually suited for: routing, grounded response, exactness-sensitive handoff, repair-assisted work.

Recovery Layer

Repair Policy

What the system allows after failure: wrapper stripping, safe repair, retries, fallback escalation, trusted extraction.

Deployment Layer

Hardware Threshold Effects

What becomes operationally viable on a laptop, workstation, or stronger machine once memory, latency, and stability cross practical thresholds.

Tools And Protocols Are Not The Same

Tools expand capability

A model without native tools can still be useful if the Phoenix wrapper supplies retrieval, validation, or execution surfaces.

Protocols constrain behavior

A model that cannot stay inside strict contracts creates downstream breakage even if it is semantically smart. Weak protocol discipline is harder to wrap away.

Gemma3 vs Gemma4

Gemma3

Environment weakness

Gemma3 may lack the native tool surface you want, but Phoenix can compensate by wrapping it in retrieval, validators, and grounded domain paths.

Missing tools can often be compensated for by the harness.

Gemma4

Measurement integrity first

The original gemma4 protocol failures were a capture artifact: ollama run subprocess capture embeds VT100 terminal sequences that corrupt multi-line JSON for thinking-mode models. Under clean REST API capture, gemma4:31b passes all six protocol probes — the strongest local result on this lane. gemma4:26b passes with /no_think suppression.

The harness boundary matters for measurement, not just for deployment. Invalid capture produces invalid rankings.

What Dominates In Each Regime?

Routing and compression

Dominant factor: routing structure and compression surface.

Grounded single-domain use

Dominant factor: grounding quality and deterministic substrate quality.

Exactness-sensitive handoff

Dominant factor: protocol compliance under pressure.

Repair-assisted pipeline

Dominant factor: explicit repair policy.

Laptop-threshold operation

Dominant factor: role fit and threshold-appropriate model choice.

OpenClaw Wraps Phoenix — It Does Not Replace It

The boundary rule

Shell outside, authority inside

OpenClaw provides the operator access layer: five HTTP surfaces, a hardening gate, and an incident workflow. Phoenix provides what makes those surfaces trustworthy: deterministic substrates, solver-backed outputs, grounded domain paths. The authority boundary does not move regardless of which use case is active.

An operator shell is not the authority layer and should never become one.

Seven use cases

Measurement, discipline, decision surface, demo

Seven use cases across three categories prove the pattern holds under load. Rule-based routing delivers 41/41 accuracy vs 39/41 forced-LLM. Substrate coverage accounts for 82.4% of harness feature importance. Every measurement routes to a Phoenix backend; every demo shows the deterministic layer outperforming the model in its own domain.

The shell stays outside. The authority stays inside. Correctness stays in the deterministic layer.

What Phoenix Is Actually Saying

Do not ask only which model is strongest. Ask which layer is carrying the result: grounding, routing, protocol discipline, repair policy, role fit, or raw model power.

Flagship Synthesis

These details are the evidence layer. The broader synthesis claim lives in Where Orchestration Beats Raw Model Power.