Project Phoenix
Phoenix Boundary Results
Three papers that map where the organized Phoenix stack hits its limits — and what
those limits reveal about system design. Failure is structural and predictable.
The ceiling is often speed, not capability. And the organized stack loses in
identifiable regimes.
The boundary layer
The Three Papers
Grounded Agent Failure Is Structurally Determined
Paper 1.10 — failure prediction
Agentic failure family and outcome path are predictable from harness configuration
features, not query content. Substrate coverage accounts for 82.4% of harness
feature importance in TourAgent. Near-universal miss queries are indistinguishable
by all classifier-accessible features — confirming they are routing surface design
failures, not query-content failures.
Know the harness. That is where the failure lives.
True Ski Chalet Boundary Result
Paper 1.14 — operational speed ceiling
The first full true V4 ski chalet run: one local RTX 3090 machine, one data bundle,
no cloud access. The local-only gemma3:27b + MatchAgent stack answered all 10
original questions correctly — but only 6/10 arrived within the 120-second usability
threshold. Capability is not the local-only ceiling. Operational speed on derived
and repair-heavy queries is.
10/10 correct. 6/10 fast enough. The boundary is real.
When The Organized Stack Loses
Paper 1.15 — boundary conditions of the orchestration advantage
Paper 1.6 established that the organized operating stack beats raw model power in
identifiable regimes. This paper maps the boundary conditions where that advantage
collapses: latency ceiling (coordination overhead consumes the time budget),
coverage gap (routing surface design failures invisible to classifiers and
stronger models alike), optimization maturity gap (PyTorch beats fused Numba CUDA
5.5x), runtime mismatch (ROCm wheel lacks gfx1151 target), and policy/role mismatch
(larger model loses to better-fit smaller model in a specific regime).
The advantage is real. So are the five ways it collapses.
The central claim
Failure Is Structural, Not Random
Finding
Harness Configuration Dominates
Outcome prediction requires knowing the harness, not the query. Substrate coverage,
routing surface design, and repair policy tier are the predictive features.
Query content is not.
Implication
Domain Expertise Is The Binding Constraint
Correct failure prediction requires domain expertise in harness architecture.
A practitioner who does not know the harness cannot predict where the system
will fail — regardless of how well they know the model.
Two kinds of ceiling
Capability vs. Operational Speed
Capability ceiling
The model cannot produce a correct answer. This is what most local-model
skepticism assumes. Paper 1.14 shows this is not the binding constraint
in the true ski chalet scenario — the answers were correct.
Operational speed ceiling
The model can produce the correct answer but not within the usability
threshold. Derived and repair-heavy queries consume the time budget before
the answer arrives. This is the real local-only ceiling.
Five failure modes
When The Organized Stack Loses
Mode 1
Latency Ceiling
Coordination and repair overhead consume the time budget before the stronger primitive would have. The organized path is slower, not just slower enough to notice.
Mode 2
Coverage Gap
The routing surface design is itself the failure point. Not fixable by a stronger model or more repair. Visible only to domain expertise in the harness architecture.
Mode 3
Optimization Maturity Gap
The raw primitive has accumulated decades of optimization the organized path cannot replicate. PyTorch beats a fused custom Numba CUDA kernel 5.5x — not because the architecture is wrong, but because the ecosystem is mature.
Mode 4
Runtime Mismatch
Hardware support gaps break the organized path before the capability question even arises. ROCm wheel lacks the gfx1151 target; raw HIP works. The organized layer cannot compensate for a missing runtime.
Mode 5
Policy / Role Mismatch
A larger model loses to a better-fit smaller model in a specific operating regime. More raw size does not recover the loss when the role assignment is wrong.
Repair policy findings
Policy Tier Predicts Protocol Acceptance
Strict
19% protocol acceptance. No repair shim. Baseline.
Wrapper
42% protocol acceptance. Wrapper stripping applied.
Safe Repair
50% protocol acceptance. Full safe repair policy applied.
Monotonic
Repair policy tier predicts acceptance monotonically across all tested configurations.
Implication
Repair policy is a first-class design decision. It is not a fallback — it is a performance variable.
One-line thesis
What These Papers Are Actually Saying
The organized stack wins in identifiable regimes. It loses in identifiable regimes too.
Knowing both boundaries is what separates domain expertise from benchmark optimism.