Project Phoenix · Measurement Integrity · Papers 1.16 and 1.19

The Model Did Not Fail The Protocol. The Terminal Did.

A protocol evaluation score is a joint claim about the model and the measurement apparatus. When the apparatus is not neutral, the score cannot be interpreted as a model-only signal. This paper reports a specific non-neutrality — and what changed when it was fixed. A companion paper (1.19) follows the same principle into the model itself: when a capable model applies semantic correction to a literal substrate, the model becomes part of the non-neutrality.

Capture Infrastructure Is Part of the Result

What was failing

The capture transport, not the model.

ollama run subprocess capture includes VT100 terminal cursor-rewrite sequences alongside model output. These sequences corrupt multi-line JSON at the content level. The corruption is not recoverable by post-hoc stripping — the characters are gone. The effect is disproportionate for thinking-mode models, whose long reasoning preambles trigger more terminal rewrite events.

What VT100 Capture Actually Produces

Mechanism

Terminal Sequences in Captured Stdout

When ollama run streams output to a terminal, it uses VT100 cursor-movement sequences to update the display in place — cursor-left followed by clear-to-end-of-line: ESC[ND ESC[K. In an interactive terminal, these are processed by the terminal emulator. Captured via subprocess.run(..., capture_output=True), the raw escape sequences appear alongside the text they were intended to overwrite.

Flat schemas

Recoverable Corruption

For flat single-line JSON, the corruption pattern is a full restart from the opening brace — a truncated partial followed by the restarted complete JSON. Post-hoc extraction can find the last valid JSON block and recover the correct value. This is why flat-schema probes (PROTO_001, PROTO_004) showed mixed results under the legacy runner rather than total failure.

Nested and multi-line schemas

Unrecoverable Corruption

For nested JSON, terminal rewrites apply to individual field names mid-string. The result is merged text — for example, "Sloane StepheStephens" after stripping escape codes. This is content-level corruption. No extractor can recover it: the characters are gone. These probes produced systematic false negatives for any thinking-mode model evaluated through a terminal transport.

Why thinking-mode models are disproportionately affected

Longer Output, More Rewrites

Thinking-mode models emit a long reasoning preamble before the final response. The preamble triggers many more terminal rewrite events across a longer buffer. The legacy extractor's greedy regex, anchored from the first brace in the full string, captured a corrupted mega-block spanning the entire preamble rather than just the final JSON response. Non-thinking-mode models emit shorter outputs with fewer rewrites and are less affected — which is why they serve as the control condition.

What the Reparse Showed — Before Any Model Was Rerun

gpt-oss:20b — The Strongest Recovery

gpt-oss:20b recovered +8/12, from 1/12 to 9/12 under the updated extractor alone — before any model was rerun. This is the strongest evidence that the capture confound was systematic, not a gemma4-specific quirk. gpt-oss:20b's prior protocol story was almost entirely a measurement artifact. Its hardware stability issues remain a separate disqualifier, but the protocol-following characterisation must be treated as unreliable until a clean-transport run can be completed.

Prior score: 1/12. Corrected reparse: 9/12. No model was rerun.

The Control Condition Holds

The recovery is selective: only models with thinking-mode output show any change under the updated extractor. gemma3:27b, llama3.1:8b, and qwen2.5:14b — all non-thinking-mode — show no score change. If the fix were recovering noise rather than real signal, it would also affect these models. It does not. The selectivity of the recovery is what makes the causal claim defensible.

Non-thinking-mode models: no change. The control holds.

gemma4:26b — Partial Recovery, Real Residual Failures

Even after the extractor fix, gemma4:26b does not fully recover (8/18 vs 18/18). Some failures are genuine. The reparse distinguishes the two failure classes — capture-induced false negatives and real model failures — where the raw scores could not. This is the methodological contribution of the reparse step: not inflating scores, but separating measurement artifact from real behavior.

The reparse does not make models look better than they are. It removes the noise that made them look worse.

The Canonical Clean-Capture Result

gemma4:31b — 6/6

Under clean REST API capture, gemma4:31b passes all six probes without think suppression. This is the strongest result on this lane across any model or configuration tested. Under the legacy terminal capture, this result was invisible — the preamble length triggered enough rewrites to corrupt the extraction consistently.

gemma4:26b — Probe-Specific Residual

gemma4:26b unsuppressed fails only PROTO_001 and PROTO_004 — the two flat-schema probes — under REST API capture. Think suppression via /no_think fixes both completely. The residual issue with unsuppressed 26b is specific to those two probes, not a broad protocol regression. The extension suite (PROTO_011–016) confirms this.

PROTO_010 — A Genuine Content Failure

PROTO_010 remains a genuine failure for gemma3:27b and qwen2.5 regardless of capture method. gemma4 passes it. This probe is retained as a control case demonstrating that clean-capture does not simply inflate scores — it preserves real failures while removing measurement artifacts. The failure is tied to canonicalization under format ambiguity, not to schema class or eligibility set construction.

The Repair — One Function Swap

HTTP POST to /api/generate with stream: false. Clean JSON response body with no terminal codes. The runner change is a one-function swap. The cost is negligible. Any evaluation pipeline still using ollama run subprocess capture for JSON protocol work should be treated as producing capture-confounded results until corrected.

Failures Are Probe-Specific, Not Structural-Class

Finding

All Five Model/Mode Combinations Pass All Six New Probes

Six new probes were designed to test whether failure patterns from the original suite generalize. Three flat-schema probes targeted gemma4:26b unsuppressed (which failed PROTO_001 and PROTO_004). Three multi-entity content probes targeted gemma3:27b (which failed PROTO_010). All pass. The original failures do not replicate on structurally similar probes.

Implication

Schema Class Is Too Coarse a Taxonomy

The relevant unit of difficulty is narrower than flat, nested, or array/mixed. A more useful taxonomy distinguishes: scalar precision, exact-string reproduction, filtered eligibility set construction, and grouped aggregation. PROTO_001 and PROTO_004 failures fit scalar precision. PROTO_010 failures fit exact-string reproduction combined with implicit eligibility. The new probes exercise other categories without that combination — which is why they do not replicate the original failures.

What This Changes

The corrected story is not that gemma4 is a strong protocol model that was wrongly accused. It is that evaluation infrastructure is part of the claim being made when a score is reported. A score from a terminal-capture pipeline and a score from a clean-transport pipeline are not the same measurement — even when run on the same model with the same prompts. The specific mechanism here is not a gemma4-specific issue. It affects any model whose output triggers terminal rewrites in Ollama's streaming display. Any evaluation pipeline still using ollama run subprocess capture for JSON protocol work should be treated as producing capture-confounded results until corrected.

What This Paper Does Not Claim

Scope

One Domain Lane, Six Probes, Single Hardware

The controlled rerun used a single hardware configuration (RTX 3090 desktop). The think-suppression finding for gemma4:26b is based on one run per condition and should be confirmed with a second replicate before being treated as stable.

gpt-oss:20b

Reparse Only — No Clean-Transport Rerun

gpt-oss:20b was not included in the controlled rerun due to documented hardware stability risk. Its corrected scores are based on historical artifacts only. Hardware stability issues remain a separate disqualifier from protocol performance.

The broader claim

Not That gemma4:31b Is Generally Strong

This paper claims only that in this bounded setup, the terminal capture confound was the dominant source of gemma4's apparent protocol failure, and that removing it changes the ranking materially. It does not claim that gemma4:31b is generally strong across all protocol tasks, that think suppression is universally beneficial, or that the REST API transport eliminates all evaluation confounds.

Protocol Trust Requires Capture Integrity

The model did not fail the protocol. The terminal did. The measurement apparatus captured terminal display sequences alongside model output, corrupted the JSON extraction, and produced false negatives that were mistaken for model-capability failures. The fix is to not use a terminal as a capture transport. A harness that misclassifies correct model output as a failure is not a neutral observer. It is part of the result.

Literal Substrate Inspection — When Stronger Models Override the Evidence

Paper 1.16 fixed a capture layer. Paper 1.19 asks the next question: once the apparatus is clean, does the model itself inspect the substrate, or does it reinterpret the substrate as a familiar semantic object? The answer, for literal-substrate tasks, is that more capable models can be less reliable — not because they lack intelligence, but because they apply the wrong kind of intelligence.

Literal Tasks Are Authority-Bound Tasks

The miniature probe

How many p's are there in strawperry?

Three answer paths are available: literal substrate inspection (count the exact string), reasoning trace (decompose and count), or semantic correction (silently normalize strawperry toward strawberry or a memorized benchmark meme, then answer from the wrong object). The correct answer is not a matter of model confidence. It is a matter of which object the system treats as authoritative.

Controlled Matrix — Ten Misspelled Strings, Four Local Models

gemma3:27b — 4/10

Normalized strawperry toward strawberry under the loose prompt and answered incorrectly. Under the tight controlled prompt (REST API, think=false, temperature 0, integer-only output) it still reached only 4/10. Failures spread across semantic normalization and apparent counting hallucination.

gemma4:26b — 9/10

With thinking enabled on the loose prompt, decomposed the literal string and answered 1. With --think=false, answered 1 without visible scratchpad. On the controlled matrix, 9/10. The one failure — raspberrry / r expected 4, answered 3 — is the cleanest semantic-normalization signature in the run.

gemma4:31b — 9/10

Ties with 26b on the tight matrix, but fails on different strings — a tie, not a dominance relationship. Under the loose prompt, answered 2 for strawperry while correctly preserving the adjacent nonsense string strawserry / s. The string-selective failure is the sharpest single-prompt result in the probe.

qwen2.5:14b — 3/10

Answered 0 for blackbery / r, 1 for bananna / n, and 2 for raspberrry / r. These answers match neither the literal string nor the canonical spelling — a different mechanism from gemma3's semantic repair, closer to misreading or tokenization confusion.

At Least Three Wrong-Count Patterns, Not One Failure Class

Mechanism 1

Semantic Normalization Toward a Canonical Form

The cleanest example is raspberrry / r. The literal string has four r characters; canonical raspberry has three. Both gemma4:26b and gemma4:31b answered 3 under the tight controlled prompt. The prompt explicitly named the exact string — the model overrode it anyway. This is the sharpest single row in the matrix.

Mechanism 2

Familiar-String Prior Interference (Meme-Prior Hypothesis)

The loose-prompt gemma4:31b contrast sharpens the same point from the other direction. It answered 2 for strawperry / p but correctly answered 2 for the adjacent nonsense strawserry / s. The model is not globally unable to count. It fails selectively when the string is close to a familiar benchmark or semantic object. The observable fact is that the same model preserves one literal nonsense string while overriding another nearby one.

Mechanism 3

Misreading, Tokenization Confusion, or Counting Hallucination

qwen2.5's 0 for blackbery matches neither the literal nor the canonical count. gemma3's 3 for grapefruut / u (literal 2, canonical 1) fits neither hypothesis cleanly. These are still literal-substrate failures — but the mechanism is character decomposition error or hallucination, not semantic repair. The failure class is broader than "the model corrected your typo."

The Fix Is Not a Smarter Model. It Is a Harness.

The Right Implementation

For the strawperry prompt, the correct implementation is not to ask the model to be smarter. It is:

"strawperry".count("p")

The model interprets intent. The substrate provides authority.

The Model's Role Is Routing

Recognize that the user is asking for a literal string count. Preserve the exact input string. Call a deterministic counter. Report the counter result without substitution. The same design principle runs across the Phoenix/OpenClaw stack: the model interprets intent; the substrate provides authority.

Routing, not reasoning, is the reliable model role here.

Four Harness Constraints for Literal-Substrate Tasks

Capture: preserve the exact user-provided string or source artifact. Classification: route literal operations to deterministic tools. Execution: perform counting, lookup, parsing, or comparison outside the model. Reporting: require the final answer to cite the preserved substrate and tool result.

This pattern applies beyond character counting — CSV row lookup, PPR disclosure counts, code symbol search, file path manipulation, schema validation, exact record reproduction, regulated-data summarization.

Literal Tasks Need Literal Substrates

In tasks where the answer is defined by a literal substrate, stronger models can be less reliable unless the harness prevents semantic override. This is deliberately narrower than "larger models are worse." Larger models are often better. The claim is that larger or more capable models may carry stronger priors, stronger repair behavior, and stronger benchmark-meme recall. Those traits help many reasoning tasks but can damage literal fidelity. Without a harness, model intelligence can become semantic interference.

What This Paper Does Not Claim

Not a ranking

gemma4:26b Is Not Generally Better Than gemma4:31b

Under the tight controlled prompt they tie at 9/10 — they simply fail on different strings. The evidence does not support a broad model ranking. It supports a narrower claim about literal fidelity under semantic-prior pressure.

Not a tokenization story alone

Tokenization Is Useful but Insufficient

Tokenization explains why models do not see characters directly. It does not explain the selective, string-specific pattern where a model preserves one nonsense string and overrides another nearby one. At least three distinct wrong-count mechanisms are present in the matrix.

Not about character counting

The Systems Claim Generalizes Beyond Toy Strings

The same failure class appears whenever a model substitutes a plausible object for the actual object — repairing an unusual product name, normalizing an Unknown field into a plausible family name, or answering from a memorized benchmark pattern instead of the current prompt. The surface task differs; the failure is the same.

Related Papers

Papers 1.16 and 1.19 together form the Measurement Integrity cluster: the capture layer and the model's own disposition toward the substrate are both part of the apparatus. The operator shell that surfaced the 1.16 capture finding is documented in Paper 1.17 — The Operator Shell Pattern. The PPR agent — the production application of the substrate-as-authority principle that 1.19 generalizes — is documented as Paper 1.18 — PPR Agent, A Deterministic Substrate for Auditable Medical-Device Intelligence. The broader local model evidence sits in Local Model Details. All nineteen Phoenix papers are in the Research Papers index.