Project Phoenix · Measurement Integrity · Papers 1.16 and 1.19
The Model Did Not Fail The Protocol. The Terminal Did.
A protocol evaluation score is a joint claim about the model and the measurement apparatus.
When the apparatus is not neutral, the score cannot be interpreted as a model-only signal.
This paper reports a specific non-neutrality — and what changed when it was fixed.
A companion paper (1.19) follows the same principle into the model itself: when a capable
model applies semantic correction to a literal substrate, the model becomes part of the
non-neutrality.
The claim
Capture Infrastructure Is Part of the Result
What was being measured
Six grounded protocol probes across three schema shapes — flat, nested, and array/mixed — on a TourAgent benchmark lane. The model is asked to return valid JSON matching a schema exactly.
Three policy tiers apply: strict (exact JSON, no wrapping), wrapper (fenced or lightly
wrapped JSON accepted), safe_repair (broader extraction attempted). Results were reported
per probe and per tier.
What was failing
The capture transport, not the model.
ollama run subprocess capture includes VT100 terminal cursor-rewrite
sequences alongside model output. These sequences corrupt multi-line JSON at the
content level. The corruption is not recoverable by post-hoc stripping — the characters
are gone. The effect is disproportionate for thinking-mode models, whose long reasoning
preambles trigger more terminal rewrite events.
The failure mode
What VT100 Capture Actually Produces
Mechanism
Terminal Sequences in Captured Stdout
When ollama run streams output to a terminal, it uses VT100
cursor-movement sequences to update the display in place — cursor-left followed
by clear-to-end-of-line: ESC[ND ESC[K. In an interactive terminal,
these are processed by the terminal emulator. Captured via
subprocess.run(..., capture_output=True), the raw escape sequences
appear alongside the text they were intended to overwrite.
Flat schemas
Recoverable Corruption
For flat single-line JSON, the corruption pattern is a full restart from the opening
brace — a truncated partial followed by the restarted complete JSON. Post-hoc
extraction can find the last valid JSON block and recover the correct value.
This is why flat-schema probes (PROTO_001, PROTO_004) showed mixed results
under the legacy runner rather than total failure.
Nested and multi-line schemas
Unrecoverable Corruption
For nested JSON, terminal rewrites apply to individual field names mid-string.
The result is merged text — for example, "Sloane StepheStephens"
after stripping escape codes. This is content-level corruption. No extractor
can recover it: the characters are gone. These probes produced systematic
false negatives for any thinking-mode model evaluated through a terminal transport.
Why thinking-mode models are disproportionately affected
Longer Output, More Rewrites
Thinking-mode models emit a long reasoning preamble before the final response.
The preamble triggers many more terminal rewrite events across a longer buffer.
The legacy extractor's greedy regex, anchored from the first brace in the full
string, captured a corrupted mega-block spanning the entire preamble rather than
just the final JSON response. Non-thinking-mode models emit shorter outputs with
fewer rewrites and are less affected — which is why they serve as the control condition.
Primary result
What the Reparse Showed — Before Any Model Was Rerun
gpt-oss:20b — The Strongest Recovery
gpt-oss:20b recovered +8/12, from 1/12 to 9/12 under the updated extractor alone —
before any model was rerun. This is the strongest evidence that the capture confound
was systematic, not a gemma4-specific quirk. gpt-oss:20b's prior protocol story was
almost entirely a measurement artifact. Its hardware stability issues remain a
separate disqualifier, but the protocol-following characterisation must be treated
as unreliable until a clean-transport run can be completed.
Prior score: 1/12. Corrected reparse: 9/12. No model was rerun.
The Control Condition Holds
The recovery is selective: only models with thinking-mode output show any change
under the updated extractor. gemma3:27b, llama3.1:8b, and qwen2.5:14b — all
non-thinking-mode — show no score change. If the fix were recovering noise rather
than real signal, it would also affect these models. It does not. The selectivity
of the recovery is what makes the causal claim defensible.
Non-thinking-mode models: no change. The control holds.
gemma4:26b — Partial Recovery, Real Residual Failures
Even after the extractor fix, gemma4:26b does not fully recover (8/18 vs 18/18).
Some failures are genuine. The reparse distinguishes the two failure classes —
capture-induced false negatives and real model failures — where the raw scores
could not. This is the methodological contribution of the reparse step: not
inflating scores, but separating measurement artifact from real behavior.
The reparse does not make models look better than they are. It removes the noise that made them look worse.
Controlled rerun — REST API capture
The Canonical Clean-Capture Result
gemma4:31b — 6/6
Under clean REST API capture, gemma4:31b passes all six probes without think suppression.
This is the strongest result on this lane across any model or configuration tested.
Under the legacy terminal capture, this result was invisible — the preamble length
triggered enough rewrites to corrupt the extraction consistently.
gemma4:26b — Probe-Specific Residual
gemma4:26b unsuppressed fails only PROTO_001 and PROTO_004 — the two flat-schema probes —
under REST API capture. Think suppression via /no_think fixes both completely.
The residual issue with unsuppressed 26b is specific to those two probes, not a broad
protocol regression. The extension suite (PROTO_011–016) confirms this.
PROTO_010 — A Genuine Content Failure
PROTO_010 remains a genuine failure for gemma3:27b and qwen2.5 regardless of capture
method. gemma4 passes it. This probe is retained as a control case demonstrating that
clean-capture does not simply inflate scores — it preserves real failures while removing
measurement artifacts. The failure is tied to canonicalization under format ambiguity,
not to schema class or eligibility set construction.
The Repair — One Function Swap
HTTP POST to /api/generate with stream: false.
Clean JSON response body with no terminal codes. The runner change is a
one-function swap. The cost is negligible. Any evaluation pipeline still using
ollama run subprocess capture for JSON protocol work should be treated
as producing capture-confounded results until corrected.
Extension — PROTO_011–016
Failures Are Probe-Specific, Not Structural-Class
Finding
All Five Model/Mode Combinations Pass All Six New Probes
Six new probes were designed to test whether failure patterns from the original suite
generalize. Three flat-schema probes targeted gemma4:26b unsuppressed (which failed
PROTO_001 and PROTO_004). Three multi-entity content probes targeted gemma3:27b
(which failed PROTO_010). All pass. The original failures do not replicate on
structurally similar probes.
Implication
Schema Class Is Too Coarse a Taxonomy
The relevant unit of difficulty is narrower than flat, nested, or array/mixed.
A more useful taxonomy distinguishes: scalar precision, exact-string reproduction,
filtered eligibility set construction, and grouped aggregation. PROTO_001 and PROTO_004
failures fit scalar precision. PROTO_010 failures fit exact-string reproduction combined
with implicit eligibility. The new probes exercise other categories without that
combination — which is why they do not replicate the original failures.
Main interpretation
What This Changes
The corrected story is not that gemma4 is a strong protocol model that was wrongly accused.
It is that evaluation infrastructure is part of the claim being made when a score is reported.
A score from a terminal-capture pipeline and a score from a clean-transport pipeline are not
the same measurement — even when run on the same model with the same prompts.
The specific mechanism here is not a gemma4-specific issue. It affects any model whose output
triggers terminal rewrites in Ollama's streaming display. Any evaluation pipeline still using
ollama run
subprocess capture for JSON protocol work should be treated as producing capture-confounded
results until corrected.
Honest limits
What This Paper Does Not Claim
Scope
One Domain Lane, Six Probes, Single Hardware
The controlled rerun used a single hardware configuration (RTX 3090 desktop).
The think-suppression finding for gemma4:26b is based on one run per condition
and should be confirmed with a second replicate before being treated as stable.
gpt-oss:20b
Reparse Only — No Clean-Transport Rerun
gpt-oss:20b was not included in the controlled rerun due to documented hardware
stability risk. Its corrected scores are based on historical artifacts only.
Hardware stability issues remain a separate disqualifier from protocol performance.
The broader claim
Not That gemma4:31b Is Generally Strong
This paper claims only that in this bounded setup, the terminal capture confound
was the dominant source of gemma4's apparent protocol failure, and that removing it
changes the ranking materially. It does not claim that gemma4:31b is generally strong
across all protocol tasks, that think suppression is universally beneficial, or that
the REST API transport eliminates all evaluation confounds.
Conclusion
Protocol Trust Requires Capture Integrity
The model did not fail the protocol. The terminal did. The measurement apparatus captured
terminal display sequences alongside model output, corrupted the JSON extraction, and
produced false negatives that were mistaken for model-capability failures. The fix is to
not use a terminal as a capture transport. A harness that misclassifies correct model
output as a failure is not a neutral observer. It is part of the result.
Companion paper · Project Phoenix · Paper 1.19
Literal Substrate Inspection — When Stronger Models Override the Evidence
Paper 1.16 fixed a capture layer. Paper 1.19 asks the next question: once the
apparatus is clean, does the model itself inspect the substrate, or does it reinterpret
the substrate as a familiar semantic object? The answer, for literal-substrate tasks,
is that more capable models can be less reliable — not because they lack intelligence,
but because they apply the wrong kind of intelligence.
1.19 · The claim
Literal Tasks Are Authority-Bound Tasks
Core takeaway
Stronger models do not remove the need for harnesses. Sometimes they increase it.
When semantic correction overrides literal substrate inspection, a more capable model
can produce a worse answer than a smaller or less opinionated model. The ground truth
for a literal task is the substrate itself — the byte string, the database row, the
file path, the schema, the counter, or the deterministic tool result. If the model
reinterprets that substrate as a familiar semantic object, capability becomes a liability.
The miniature probe
How many p's are there in strawperry?
Three answer paths are available: literal substrate inspection (count the exact string),
reasoning trace (decompose and count), or semantic correction (silently normalize
strawperry toward strawberry or a memorized benchmark meme, then
answer from the wrong object). The correct answer is not a matter of model confidence.
It is a matter of which object the system treats as authoritative.
1.19 · Observed local results
Controlled Matrix — Ten Misspelled Strings, Four Local Models
gemma3:27b — 4/10
Normalized strawperry toward strawberry under the loose prompt and
answered incorrectly. Under the tight controlled prompt (REST API, think=false,
temperature 0, integer-only output) it still reached only 4/10. Failures spread across
semantic normalization and apparent counting hallucination.
gemma4:26b — 9/10
With thinking enabled on the loose prompt, decomposed the literal string and answered
1. With --think=false, answered 1 without visible
scratchpad. On the controlled matrix, 9/10. The one failure — raspberrry /
r expected 4, answered 3 — is the cleanest semantic-normalization signature
in the run.
gemma4:31b — 9/10
Ties with 26b on the tight matrix, but fails on different strings — a tie, not a dominance
relationship. Under the loose prompt, answered 2 for strawperry
while correctly preserving the adjacent nonsense string strawserry / s.
The string-selective failure is the sharpest single-prompt result in the probe.
qwen2.5:14b — 3/10
Answered 0 for blackbery / r, 1 for
bananna / n, and 2 for raspberrry /
r. These answers match neither the literal string nor the canonical spelling —
a different mechanism from gemma3's semantic repair, closer to misreading or tokenization
confusion.
1.19 · Interpretation
At Least Three Wrong-Count Patterns, Not One Failure Class
Mechanism 1
Semantic Normalization Toward a Canonical Form
The cleanest example is raspberrry / r. The literal string has
four r characters; canonical raspberry has three. Both gemma4:26b
and gemma4:31b answered 3 under the tight controlled prompt. The prompt explicitly
named the exact string — the model overrode it anyway. This is the sharpest single row in the
matrix.
Mechanism 2
Familiar-String Prior Interference (Meme-Prior Hypothesis)
The loose-prompt gemma4:31b contrast sharpens the same point from the other direction.
It answered 2 for strawperry / p but correctly answered
2 for the adjacent nonsense strawserry / s. The
model is not globally unable to count. It fails selectively when the string is close to a
familiar benchmark or semantic object. The observable fact is that the same model preserves
one literal nonsense string while overriding another nearby one.
Mechanism 3
Misreading, Tokenization Confusion, or Counting Hallucination
qwen2.5's 0 for blackbery matches neither the literal nor the canonical
count. gemma3's 3 for grapefruut / u (literal 2,
canonical 1) fits neither hypothesis cleanly. These are still literal-substrate
failures — but the mechanism is character decomposition error or hallucination, not semantic
repair. The failure class is broader than "the model corrected your typo."
1.19 · Phoenix framing
The Fix Is Not a Smarter Model. It Is a Harness.
The Right Implementation
For the strawperry prompt, the correct implementation is not to ask the model
to be smarter. It is:
"strawperry".count("p")
The model interprets intent. The substrate provides authority.
The Model's Role Is Routing
Recognize that the user is asking for a literal string count. Preserve the exact input
string. Call a deterministic counter. Report the counter result without substitution.
The same design principle runs across the Phoenix/OpenClaw stack: the model interprets
intent; the substrate provides authority.
Routing, not reasoning, is the reliable model role here.
Four Harness Constraints for Literal-Substrate Tasks
Capture: preserve the exact user-provided string or source artifact.
Classification: route literal operations to deterministic tools.
Execution: perform counting, lookup, parsing, or comparison outside the model.
Reporting: require the final answer to cite the preserved substrate and tool result.
This pattern applies beyond character counting — CSV row lookup, PPR disclosure counts,
code symbol search, file path manipulation, schema validation, exact record reproduction,
regulated-data summarization.
1.19 · Main claim
Literal Tasks Need Literal Substrates
In tasks where the answer is defined by a literal substrate, stronger models can be less
reliable unless the harness prevents semantic override. This is deliberately narrower than
"larger models are worse." Larger models are often better. The claim is that larger or more
capable models may carry stronger priors, stronger repair behavior, and stronger
benchmark-meme recall. Those traits help many reasoning tasks but can damage literal fidelity.
Without a harness, model intelligence can become semantic interference.
1.19 · Honest limits
What This Paper Does Not Claim
Not a ranking
gemma4:26b Is Not Generally Better Than gemma4:31b
Under the tight controlled prompt they tie at 9/10 — they simply fail on different strings.
The evidence does not support a broad model ranking. It supports a narrower claim about
literal fidelity under semantic-prior pressure.
Not a tokenization story alone
Tokenization Is Useful but Insufficient
Tokenization explains why models do not see characters directly. It does not explain the
selective, string-specific pattern where a model preserves one nonsense string and
overrides another nearby one. At least three distinct wrong-count mechanisms are present
in the matrix.
Not about character counting
The Systems Claim Generalizes Beyond Toy Strings
The same failure class appears whenever a model substitutes a plausible object for the
actual object — repairing an unusual product name, normalizing an Unknown
field into a plausible family name, or answering from a memorized benchmark pattern
instead of the current prompt. The surface task differs; the failure is the same.