Project Phoenix Research Papers

19 Primary Papers 12 Live Sites 9 Research Clusters

2026 Active

The finding: For grounded domain tasks — well-defined task classes with deterministic substrates — harness configuration is the binding constraint. Model identity is not. These papers prove this claim under stress test conditions: local models, which cannot compensate for a weak harness, converge with frontier models at the semantic usefulness level when the harness is sufficient. This scope is deliberate. Outside it, model capability matters in ways the framework does not cover.

Local Model Addendum: the Project Phoenix Local Model Details page covers the three supporting papers that feed the orchestration synthesis: TourAgent (1.13), ShowcaseAgent (1.12), and Local Model Role Suitability (1.11).

Boundary Results: the Project Phoenix Boundary Results page covers three papers that map where the organized stack hits its limits: Grounded Agent Failure Is Structurally Determined (1.10), True Ski Chalet Boundary Result (1.14), and When The Organized Stack Loses (1.15).

RVH / ML Evaluation: Rough Volatility as ML Benchmark covers Papers 1.8 and 1.9 — why domain expertise, not ML capability, is the binding constraint in rough volatility forecasting and the cross-domain benchmark principle it reveals.

Measurement Integrity, Operator Layer & Applied Evidence: Papers 1.16–1.19 extend the framework outward. Paper 1.16 shows that evaluation infrastructure can fail at the capture boundary — a VT100 terminal artifact was corrupting protocol scores for thinking-mode models. Paper 1.17 documents the operator shell pattern: how OpenClaw wraps Project Phoenix as an access layer without becoming the authority. Paper 1.18 is the framework's first numbered production case — PPR Agent, 92M regulated cardiac device implants across 18 years, behind a deterministic SQLite substrate. Paper 1.19 is a short companion to 1.16 on the other side of the apparatus: when stronger models override literal substrate inspection, capability itself becomes a source of non-neutrality.

Where to Start

Each paper stands alone. Use the cluster that matches your interest:

New to Project Phoenix

Start with Paper 1.1 for the framework framing, then try the TourAgent live demo — ten tennis questions with repeatable answers — to see the deterministic approach in action.

Local Inference & Offline Systems

Papers 1.2, 1.3, 1.5 form a cluster: offline grounded agent → ski chalet hardware boundary → TSP solver-backed orchestration. The common argument: harness level, not model size, drives usefulness.

Orchestration & Role Assignment

Papers 1.5, 1.6, 1.11, 1.12, 1.13 address where correctness should live and how grounding, routing, and repair beat raw power in identifiable regimes.

Failure Modes & Boundary Conditions

Papers 1.7, 1.10, 1.14, 1.15 cover the failure taxonomy, empirical failure prediction, the true local ceiling, and the five conditions under which the organized stack's advantage collapses.

ML Evaluation & Cross-Domain Benchmarks

Papers 1.8 and 1.9 establish why realized volatility forecasting is high-signal benchmark territory — and what the same structural argument implies across semiconductor defectivity and other rough-process domains.

Measurement Integrity & Operator Layer

Papers 1.16, 1.17, and 1.19 address the infrastructure surrounding the Phoenix system. 1.16: capture pipeline failures produce false evaluation verdicts. 1.17: an operator shell can expose the deterministic stack without replacing it as the authority. 1.19: when stronger models override literal substrate inspection, the model itself becomes part of the non-neutrality.

Applied / Production Evidence

Paper 1.18 is the first numbered production case — PPR Agent running against government-mandated cardiac device data for 18 years. This is field validation, not lane validation — the framework operating against regulated disclosures from three manufacturers.

I · Grounding, Local Systems & Hardware

What makes a local or offline system actually useful — and what the evidence honestly supports.

Paper 1.2

Offline Grounded Domain Agent

The real unit of local usefulness is the harnessed domain system, not the raw model. A local model becomes operationally useful when paired with a deterministic substrate, grounding layer, explicit provenance, and a controlled escalation path. Raw local model, grounded local harness, and full local implementation-agent are three distinct things — not interchangeable.

Open site → Paper 1.3

Ski Chalet Harness Boundary

A prepared local 3090 system — Ollama, portable domain harness, and data bundle — can support grounded offline domain answering. The claim is narrow and honest: it is the harness that enables usefulness, not the raw model alone. The variable that matters most is harness level, not model size.

Open site → Paper 1.4

Fab Simulation & RVH

Semiconductor fab defectivity should be modeled as a dynamic rough process (RVH — Rough Volatility Hypothesis), not a static mean. Moving from a stable to an unstable fab produces a 7.1% loss in shippable output — a result that emerges from the path, not the average. Product complexity and process instability are separable causes of yield loss.

Open site →

II · Orchestration & Role Assignment

Where correctness should live in an AI system — and what happens when it lives in the wrong place.

Paper 1.5

LocalLLMTSP — Solver-Backed Orchestration

In a route-optimization workflow, correctness should live in the solver, not the model. Stronger models delay failure but do not eliminate the need for solver-backed architecture. Local models range from exact to structurally invalid at small scales and collapse at the world rung; the orchestrated path remains stable across the full ladder.

Open site → Paper 1.6

Where Orchestration Beats Raw Model Power

Once hardware is good enough, the organized operating stack settles the outcome before raw model size alone does. TourAgent, ShowcaseAgent, and Local Model Role Suitability together support a boundary claim: grounding, routing, and repair beat raw power in identifiable regimes.

Open site →

III · Framework & Operating Discipline

The standards, supervision structures, and failure taxonomy that make agentic work trustworthy.

Paper 1.1

Project Phoenix — Open-Core Standards

Project Phoenix is best understood as an open-core framework for grounded domain systems — not a single agent or benchmark story. Useful agentic systems require domain grounding, explicit validation, clear trust boundaries, and operating discipline. Standards, not prompt optimism.

Open site → Paper 1.7

Agentic Coding Failure Patterns

Agentic coding successes vary widely; failures recur in recognizable families. Drift, summit fever, bad context selection, false success, doom loops, and premature closure are documented across Project Phoenix operations. The practical response is standards, supervision, and lessons learned — not blind faith in scaling alone.

Open site →

IV · Local Model Addendum

Three empirical papers feeding the orchestration synthesis — grounded reliability, routing, and role suitability at portfolio scale.

Paper 1.13

TourAgent Local Model Screen

Grounding removes wrong-or-missing answers before it creates artifact-level precision. The local model screen result holds across model families once a deterministic substrate is in the path.

Open site → Paper 1.12

ShowcaseAgent Routing And Compression

Routing and compression are the first reliable local-LLM win at portfolio scale. Miss families are design signals, not capability failures — they identify where the harness, not the model, needs attention.

Open site → Paper 1.11

Local Model Role Suitability

Grounded response quality is largely model-family-independent once a deterministic substrate is in the path. The binding variable is harness configuration, not model identity.

Open site →

V · Boundary Conditions & Failure Prediction

Where the organized stack's advantage collapses — and why failure family is predictable from configuration, not query content.

Paper 1.10

Grounded Agent Failure Is Structurally Determined

Failure family is predictable from harness configuration features — not query content — confirming that domain expertise is the binding constraint. Empirically confirmed on 780 labeled rows from two Project Phoenix domains.

Open site → Paper 1.14

True Ski Chalet Boundary Result

Capability is not the local-only ceiling; operational speed on derived queries is. The true boundary separates what the harness can answer from what it cannot — not strong model from weak model.

Open site → Paper 1.15

When The Organized Stack Loses

Maps the five failure modes under which the organized stack's advantage collapses or inverts: latency ceiling (coordination overhead consumes the time budget), coverage gap (harness design failures invisible to stronger models), optimization maturity gap (PyTorch beats fused Numba CUDA 5.5×), runtime mismatch (ROCm wheel lacks gfx1151 target), and policy/role mismatch (larger model loses to better-fit smaller model in the specific regime).

Open site →

VI · RVH / ML Evaluation

Realized volatility forecasting as high-signal ML benchmark territory — and the cross-domain principle it reveals.

Paper 1.8

Rough Volatility — Cross-Domain Benchmark Principle

Both financial volatility and semiconductor defectivity satisfy the same four conditions for high-signal ML benchmark territory. The cross-domain parallel is structural, not analogical — the same rough-path argument applies to both.

Open site → Paper 1.9

Rough Volatility — ML Evaluation Domain

Realized volatility forecasting is a high-signal benchmark because naive pipeline failures are structural, not tunable. Empirically confirmed: a standard LSTM fails on realized volatility in a way that reveals domain ignorance, not hyperparameter sensitivity.

Open site →

VII · Measurement Integrity

When the evaluation infrastructure itself fails — or when the model's own disposition toward the substrate becomes part of the apparatus.

Paper 1.16

The Model Did Not Fail the Protocol. The Terminal Did.

Subprocess capture of ollama run output includes VT100 cursor-rewrite sequences that corrupt multi-line JSON for thinking-mode models, producing systematic false negatives. Under clean REST API capture, gemma4:31b passes all six protocol probes — the strongest result on this lane. The selective recovery pattern (only thinking-mode models affected) proves the failure was at the capture boundary, not the model boundary.

Open site → Paper 1.19

Literal Substrate Inspection — When Stronger Models Override the Evidence

Stronger models do not remove the need for harnesses; sometimes they increase it. When semantic correction overrides literal substrate inspection, a more capable model can produce a worse answer than a smaller or less opinionated one. A ten-prompt local matrix and a single-prompt strawperry probe show at least three distinct wrong-count mechanisms. The fix is not a smarter model — it is a harness that preserves the exact substrate and routes literal operations to deterministic tools.

Open site →

VIII · Operator Layer

Building an operator-facing outer layer over the deterministic stack — and keeping it outside the authority boundary.

Paper 1.17

The Operator Shell Pattern

An operator shell reduces friction, surfaces information, and enforces discipline without the model touching correctness. OpenClaw wraps Project Phoenix as the access layer — five HTTP operator surfaces, a hardening gate, and an incident workflow — while Phoenix remains the deterministic authority. Seven use cases across measurement, discipline, and decision surfaces confirm the pattern: the shell stays outside, the authority stays inside.

Open site →

IX · Applied / Production Evidence

Field validation, not lane validation — the framework operating against regulated data in a real domain.

Paper 1.18

PPR Agent — A Deterministic Substrate for Auditable Medical-Device Intelligence

Government-mandated Product Performance Reports from Abbott, Boston Scientific, and Medtronic — 3,576 device rows, 92 million US implants, 18 years (2008–2025), complete three-company coverage since 2014 — behind a SQLite substrate and a deterministic tool surface. The model selects a tool and formats the answer; the database supplies the facts. A frozen canonical query suite enforces that unknown values stay unknown, source gaps stay visible, and no answer is invented. Field evidence for the Phoenix thesis: substrate is authority, model is interface.

Open site →

Full Inventory

#	Title	Track	Site
Primary Papers — 1.1 through 1.7
1.1	Project Phoenix — Open-Core Standards	Framework	project-phoenix/
1.2	Offline Grounded Domain Agent	Grounding	offline-agent/
1.3	Ski Chalet Harness Boundary	Grounding	ski-chalet/
1.4	Fab Simulation & RVH	Grounding	fab-rvh/
1.5	LocalLLMTSP — Solver-Backed Orchestration	Orchestration	local-llm-tsp/
1.6	Where Orchestration Beats Raw Model Power	Orchestration	orchestration/
1.7	Agentic Coding Failure Patterns	Operations	agentic-coding/
RVH — 1.8 and 1.9
1.8	Rough Volatility — Cross-Domain Benchmark Principle	RVH / ML Eval	rough-volatility/
1.9	Rough Volatility — ML Evaluation Domain	RVH / ML Eval	rough-volatility/
Boundary & Details — 1.10 through 1.15
1.10	Grounded Agent Failure Is Structurally Determined	Boundary	failure-details/
1.11	Local Model Role Suitability	Local Model	local-model-role-suitability/
1.12	ShowcaseAgent Routing And Compression	Local Model	details/
1.13	TourAgent Local Model Screen	Local Model	details/
1.14	True Ski Chalet Boundary Result	Boundary	failure-details/
1.15	When The Organized Stack Loses	Boundary	failure-details/
Measurement Integrity — 1.16 and 1.19
1.16	The Model Did Not Fail the Protocol. The Terminal Did.	Measurement	capture-integrity/
1.19	Literal Substrate Inspection — When Stronger Models Override the Evidence	Measurement	capture-integrity/
Operator Layer — 1.17
1.17	The Operator Shell Pattern	Operator Layer	operator-shell/
Applied / Production Evidence — 1.18
1.18	PPR Agent — A Deterministic Substrate for Auditable Medical-Device Intelligence	Applied	ppr-agent/

All sites live at proto.efehnconsulting.com. Papers 1.8–1.9 share the rough-volatility site; 1.12–1.13 share the details site; 1.10/1.14/1.15 share the failure-details site; 1.16 and 1.19 share the capture-integrity site. Paper 1.17 has a dedicated site at operator-shell/. Paper 1.18 has a dedicated site at ppr-agent/.