Nineteen primary papers on grounded domain systems, orchestration architecture, agentic operating discipline, ML evaluation benchmarks, operator infrastructure, measurement integrity, and applied production evidence
The finding: For grounded domain tasks — well-defined task classes with deterministic substrates — harness configuration is the binding constraint. Model identity is not. These papers prove this claim under stress test conditions: local models, which cannot compensate for a weak harness, converge with frontier models at the semantic usefulness level when the harness is sufficient. This scope is deliberate. Outside it, model capability matters in ways the framework does not cover.
Local Model Addendum: the Project Phoenix Local Model Details page covers the three supporting papers that feed the orchestration synthesis: TourAgent (1.13), ShowcaseAgent (1.12), and Local Model Role Suitability (1.11).
Boundary Results: the Project Phoenix Boundary Results page covers three papers that map where the organized stack hits its limits: Grounded Agent Failure Is Structurally Determined (1.10), True Ski Chalet Boundary Result (1.14), and When The Organized Stack Loses (1.15).
RVH / ML Evaluation: Rough Volatility as ML Benchmark covers Papers 1.8 and 1.9 — why domain expertise, not ML capability, is the binding constraint in rough volatility forecasting and the cross-domain benchmark principle it reveals.
Measurement Integrity, Operator Layer & Applied Evidence: Papers 1.16–1.19 extend the framework outward. Paper 1.16 shows that evaluation infrastructure can fail at the capture boundary — a VT100 terminal artifact was corrupting protocol scores for thinking-mode models. Paper 1.17 documents the operator shell pattern: how OpenClaw wraps Project Phoenix as an access layer without becoming the authority. Paper 1.18 is the framework's first numbered production case — PPR Agent, 92M regulated cardiac device implants across 18 years, behind a deterministic SQLite substrate. Paper 1.19 is a short companion to 1.16 on the other side of the apparatus: when stronger models override literal substrate inspection, capability itself becomes a source of non-neutrality.
Each paper stands alone. Use the cluster that matches your interest:
Start with Paper 1.1 for the framework framing, then try the TourAgent live demo — ten tennis questions with repeatable answers — to see the deterministic approach in action.
Papers 1.2, 1.3, 1.5 form a cluster: offline grounded agent → ski chalet hardware boundary → TSP solver-backed orchestration. The common argument: harness level, not model size, drives usefulness.
Papers 1.5, 1.6, 1.11, 1.12, 1.13 address where correctness should live and how grounding, routing, and repair beat raw power in identifiable regimes.
Papers 1.7, 1.10, 1.14, 1.15 cover the failure taxonomy, empirical failure prediction, the true local ceiling, and the five conditions under which the organized stack's advantage collapses.
Papers 1.8 and 1.9 establish why realized volatility forecasting is high-signal benchmark territory — and what the same structural argument implies across semiconductor defectivity and other rough-process domains.
Papers 1.16, 1.17, and 1.19 address the infrastructure surrounding the Phoenix system. 1.16: capture pipeline failures produce false evaluation verdicts. 1.17: an operator shell can expose the deterministic stack without replacing it as the authority. 1.19: when stronger models override literal substrate inspection, the model itself becomes part of the non-neutrality.
Paper 1.18 is the first numbered production case — PPR Agent running against government-mandated cardiac device data for 18 years. This is field validation, not lane validation — the framework operating against regulated disclosures from three manufacturers.
What makes a local or offline system actually useful — and what the evidence honestly supports.
The real unit of local usefulness is the harnessed domain system, not the raw model. A local model becomes operationally useful when paired with a deterministic substrate, grounding layer, explicit provenance, and a controlled escalation path. Raw local model, grounded local harness, and full local implementation-agent are three distinct things — not interchangeable.
Open site → Paper 1.3A prepared local 3090 system — Ollama, portable domain harness, and data bundle — can support grounded offline domain answering. The claim is narrow and honest: it is the harness that enables usefulness, not the raw model alone. The variable that matters most is harness level, not model size.
Open site → Paper 1.4Semiconductor fab defectivity should be modeled as a dynamic rough process (RVH — Rough Volatility Hypothesis), not a static mean. Moving from a stable to an unstable fab produces a 7.1% loss in shippable output — a result that emerges from the path, not the average. Product complexity and process instability are separable causes of yield loss.
Open site →Where correctness should live in an AI system — and what happens when it lives in the wrong place.
In a route-optimization workflow, correctness should live in the solver, not the model. Stronger models delay failure but do not eliminate the need for solver-backed architecture. Local models range from exact to structurally invalid at small scales and collapse at the world rung; the orchestrated path remains stable across the full ladder.
Open site → Paper 1.6Once hardware is good enough, the organized operating stack settles the outcome before raw model size alone does. TourAgent, ShowcaseAgent, and Local Model Role Suitability together support a boundary claim: grounding, routing, and repair beat raw power in identifiable regimes.
Open site →The standards, supervision structures, and failure taxonomy that make agentic work trustworthy.
Project Phoenix is best understood as an open-core framework for grounded domain systems — not a single agent or benchmark story. Useful agentic systems require domain grounding, explicit validation, clear trust boundaries, and operating discipline. Standards, not prompt optimism.
Open site → Paper 1.7Agentic coding successes vary widely; failures recur in recognizable families. Drift, summit fever, bad context selection, false success, doom loops, and premature closure are documented across Project Phoenix operations. The practical response is standards, supervision, and lessons learned — not blind faith in scaling alone.
Open site →Three empirical papers feeding the orchestration synthesis — grounded reliability, routing, and role suitability at portfolio scale.
Grounding removes wrong-or-missing answers before it creates artifact-level precision. The local model screen result holds across model families once a deterministic substrate is in the path.
Open site → Paper 1.12Routing and compression are the first reliable local-LLM win at portfolio scale. Miss families are design signals, not capability failures — they identify where the harness, not the model, needs attention.
Open site → Paper 1.11Grounded response quality is largely model-family-independent once a deterministic substrate is in the path. The binding variable is harness configuration, not model identity.
Open site →Where the organized stack's advantage collapses — and why failure family is predictable from configuration, not query content.
Failure family is predictable from harness configuration features — not query content — confirming that domain expertise is the binding constraint. Empirically confirmed on 780 labeled rows from two Project Phoenix domains.
Open site → Paper 1.14Capability is not the local-only ceiling; operational speed on derived queries is. The true boundary separates what the harness can answer from what it cannot — not strong model from weak model.
Open site → Paper 1.15Maps the five failure modes under which the organized stack's advantage collapses or inverts: latency ceiling (coordination overhead consumes the time budget), coverage gap (harness design failures invisible to stronger models), optimization maturity gap (PyTorch beats fused Numba CUDA 5.5×), runtime mismatch (ROCm wheel lacks gfx1151 target), and policy/role mismatch (larger model loses to better-fit smaller model in the specific regime).
Open site →Realized volatility forecasting as high-signal ML benchmark territory — and the cross-domain principle it reveals.
Both financial volatility and semiconductor defectivity satisfy the same four conditions for high-signal ML benchmark territory. The cross-domain parallel is structural, not analogical — the same rough-path argument applies to both.
Open site → Paper 1.9Realized volatility forecasting is a high-signal benchmark because naive pipeline failures are structural, not tunable. Empirically confirmed: a standard LSTM fails on realized volatility in a way that reveals domain ignorance, not hyperparameter sensitivity.
Open site →When the evaluation infrastructure itself fails — or when the model's own disposition toward the substrate becomes part of the apparatus.
Subprocess capture of ollama run output includes VT100 cursor-rewrite sequences that corrupt multi-line JSON for thinking-mode models, producing systematic false negatives. Under clean REST API capture, gemma4:31b passes all six protocol probes — the strongest result on this lane. The selective recovery pattern (only thinking-mode models affected) proves the failure was at the capture boundary, not the model boundary.
Open site → Paper 1.19Stronger models do not remove the need for harnesses; sometimes they increase it. When semantic correction overrides literal substrate inspection, a more capable model can produce a worse answer than a smaller or less opinionated one. A ten-prompt local matrix and a single-prompt strawperry probe show at least three distinct wrong-count mechanisms. The fix is not a smarter model — it is a harness that preserves the exact substrate and routes literal operations to deterministic tools.
Building an operator-facing outer layer over the deterministic stack — and keeping it outside the authority boundary.
Field validation, not lane validation — the framework operating against regulated data in a real domain.
| # | Title | Track | Site |
|---|---|---|---|
| Primary Papers — 1.1 through 1.7 | |||
| 1.1 | Project Phoenix — Open-Core Standards | Framework | project-phoenix/ |
| 1.2 | Offline Grounded Domain Agent | Grounding | offline-agent/ |
| 1.3 | Ski Chalet Harness Boundary | Grounding | ski-chalet/ |
| 1.4 | Fab Simulation & RVH | Grounding | fab-rvh/ |
| 1.5 | LocalLLMTSP — Solver-Backed Orchestration | Orchestration | local-llm-tsp/ |
| 1.6 | Where Orchestration Beats Raw Model Power | Orchestration | orchestration/ |
| 1.7 | Agentic Coding Failure Patterns | Operations | agentic-coding/ |
| RVH — 1.8 and 1.9 | |||
| 1.8 | Rough Volatility — Cross-Domain Benchmark Principle | RVH / ML Eval | rough-volatility/ |
| 1.9 | Rough Volatility — ML Evaluation Domain | RVH / ML Eval | rough-volatility/ |
| Boundary & Details — 1.10 through 1.15 | |||
| 1.10 | Grounded Agent Failure Is Structurally Determined | Boundary | failure-details/ |
| 1.11 | Local Model Role Suitability | Local Model | local-model-role-suitability/ |
| 1.12 | ShowcaseAgent Routing And Compression | Local Model | details/ |
| 1.13 | TourAgent Local Model Screen | Local Model | details/ |
| 1.14 | True Ski Chalet Boundary Result | Boundary | failure-details/ |
| 1.15 | When The Organized Stack Loses | Boundary | failure-details/ |
| Measurement Integrity — 1.16 and 1.19 | |||
| 1.16 | The Model Did Not Fail the Protocol. The Terminal Did. | Measurement | capture-integrity/ |
| 1.19 | Literal Substrate Inspection — When Stronger Models Override the Evidence | Measurement | capture-integrity/ |
| Operator Layer — 1.17 | |||
| 1.17 | The Operator Shell Pattern | Operator Layer | operator-shell/ |
| Applied / Production Evidence — 1.18 | |||
| 1.18 | PPR Agent — A Deterministic Substrate for Auditable Medical-Device Intelligence | Applied | ppr-agent/ |
All sites live at proto.efehnconsulting.com. Papers 1.8–1.9 share the rough-volatility site; 1.12–1.13 share the details site; 1.10/1.14/1.15 share the failure-details site; 1.16 and 1.19 share the capture-integrity site. Paper 1.17 has a dedicated site at operator-shell/. Paper 1.18 has a dedicated site at ppr-agent/.