Project Phoenix

Phoenix Boundary Results

Three papers that map where the organized Phoenix stack hits its limits — and what those limits reveal about system design. Failure is structural and predictable. The ceiling is often speed, not capability. And the organized stack loses in identifiable regimes.

The Three Papers

Grounded Agent Failure Is Structurally Determined

Paper 1.10 — failure prediction

Agentic failure family and outcome path are predictable from harness configuration features, not query content. Substrate coverage accounts for 82.4% of harness feature importance in TourAgent. Near-universal miss queries are indistinguishable by all classifier-accessible features — confirming they are routing surface design failures, not query-content failures.

Know the harness. That is where the failure lives.

True Ski Chalet Boundary Result

Paper 1.14 — operational speed ceiling

The first full true V4 ski chalet run: one local RTX 3090 machine, one data bundle, no cloud access. The local-only gemma3:27b + MatchAgent stack answered all 10 original questions correctly — but only 6/10 arrived within the 120-second usability threshold. Capability is not the local-only ceiling. Operational speed on derived and repair-heavy queries is.

10/10 correct. 6/10 fast enough. The boundary is real.

When The Organized Stack Loses

Paper 1.15 — boundary conditions of the orchestration advantage

Paper 1.6 established that the organized operating stack beats raw model power in identifiable regimes. This paper maps the boundary conditions where that advantage collapses: latency ceiling (coordination overhead consumes the time budget), coverage gap (routing surface design failures invisible to classifiers and stronger models alike), optimization maturity gap (PyTorch beats fused Numba CUDA 5.5x), runtime mismatch (ROCm wheel lacks gfx1151 target), and policy/role mismatch (larger model loses to better-fit smaller model in a specific regime).

The advantage is real. So are the five ways it collapses.

Failure Is Structural, Not Random

Implication

Domain Expertise Is The Binding Constraint

Correct failure prediction requires domain expertise in harness architecture. A practitioner who does not know the harness cannot predict where the system will fail — regardless of how well they know the model.

Capability vs. Operational Speed

Capability ceiling

The model cannot produce a correct answer. This is what most local-model skepticism assumes. Paper 1.14 shows this is not the binding constraint in the true ski chalet scenario — the answers were correct.

Operational speed ceiling

The model can produce the correct answer but not within the usability threshold. Derived and repair-heavy queries consume the time budget before the answer arrives. This is the real local-only ceiling.

When The Organized Stack Loses

Mode 1

Latency Ceiling

Coordination and repair overhead consume the time budget before the stronger primitive would have. The organized path is slower, not just slower enough to notice.

Mode 2

Coverage Gap

The routing surface design is itself the failure point. Not fixable by a stronger model or more repair. Visible only to domain expertise in the harness architecture.

Mode 3

Optimization Maturity Gap

The raw primitive has accumulated decades of optimization the organized path cannot replicate. PyTorch beats a fused custom Numba CUDA kernel 5.5x — not because the architecture is wrong, but because the ecosystem is mature.

Mode 4

Runtime Mismatch

Hardware support gaps break the organized path before the capability question even arises. ROCm wheel lacks the gfx1151 target; raw HIP works. The organized layer cannot compensate for a missing runtime.

Mode 5

Policy / Role Mismatch

A larger model loses to a better-fit smaller model in a specific operating regime. More raw size does not recover the loss when the role assignment is wrong.

Policy Tier Predicts Protocol Acceptance

Strict

19% protocol acceptance. No repair shim. Baseline.

Wrapper

42% protocol acceptance. Wrapper stripping applied.

Safe Repair

50% protocol acceptance. Full safe repair policy applied.

Monotonic

Repair policy tier predicts acceptance monotonically across all tested configurations.

Implication

Repair policy is a first-class design decision. It is not a fallback — it is a performance variable.

What These Papers Are Actually Saying

The organized stack wins in identifiable regimes. It loses in identifiable regimes too. Knowing both boundaries is what separates domain expertise from benchmark optimism.

The Evidence Layers

These papers extend the flagship Phoenix claims. The local-model evidence layer lives in Local Model Details. The broader synthesis claim lives in Where Orchestration Beats Raw Model Power.