Paper · frozen

Smarter, Faster, and Bounded by Handoff Discipline

Handoff-discipline doctrine for strict machine-facing local-model lanes. Frozen 2026-04-29 with the corrected matrix, the three-datapoint drift addendum, DocDrop as the positive deployment case, and a three-run PPR Lane 2 evidence packet as the adversarial complement.

Canonical source: docs/GEMMA_4_IS_SMARTER_GEMMA_3_IS_SAFER.md

Status

Frozen 2026-04-29. Promoted from active draft on the basis of:

Thesis

Frontier-class local models like Gemma 4 are smarter and faster than their predecessors, but their default verbose-reasoning posture breaks rigid machine-to-machine JSON handoffs. The operational lesson is not to avoid the smarter model. It is to enforce **handoff discipline** at the deployment boundary. With /no_think suppression and a schema-bound output contract, the same gemma4:26b that previously appeared dangerous clears 6/6 strict- protocol probes and can be made safe for bounded deployment surfaces through strict receive-side discipline.

This is engineering doctrine, not a bug discovery.

How We Arrived Here

1. The original false read

The first bounded protocol probe of gemma4:26b returned 0/6 across three desktop runs. Every output landed as non_json. Read at face value, this looked like a within-family regression: gemma3 was passing 3/6–5/6 on the same harness; gemma4 had collapsed to 0/6 while running fast and producing clean bundles.

That read prompted the original framing of this paper line: same family, same harness, same lane — and the newer model couldn't stay inside the protocol. The provocative "more dangerous" question mark followed naturally.

2. The capture-pipeline finding

On 2026-04-15, the harness was instrumented to compare ollama run subprocess capture against the Ollama REST API directly. The legacy path was capturing raw terminal output including VT100 cursor-rewrite sequences (e.g. \x1b[6D\x1b[K) that Ollama uses for its streaming display. Those codes corrupted multi-line JSON at the content level — irrecoverable by post-hoc ANSI stripping.

Models without thinking-mode preambles (gemma3, qwen2.5) were comparatively immune because their output streamed in a more linear shape. Thinking-mode models (gemma4, gpt-oss) took the worst hits because the preamble-then-answer interleave amplified the corruption.

The fix was a one-line transport switch: /api/generate with stream: false. The legacy ollama_run path remains, gated behind a flag, for reproduction.

The original 0/6 measurement was filed upstream as google-deepmind/gemma#604. The corrected matrix in §3 below was posted as a follow-up comment to that issue on 2026-04-16; the issue remains open pending an upstream determination on whether unsuppressed gemma4:26b flat-schema discipline is expected behavior, or whether the recommended default for machine-facing JSON use cases is suppression.

3. The corrected matrix (2026-04-15 clean capture)

Same harness, same probes, clean capture:

ModelModePass
gemma3:27b5/6
qwen2.5:14b4/6
gemma4:26bunsuppressed4/6
gemma4:26b/no_think suppressed**6/6**
gemma4:31bunsuppressed**6/6**
gemma4:31b/no_think suppressed**6/6**

The 0/6 disappeared entirely. Unsuppressed gemma4:26b actually scores 4/6; the two failing probes (PROTO_001, PROTO_004) are flat-schema cases where the thinking preamble interleaves with JSON output even via clean transport. Suppression closes that gap completely.

PROTO_010 is the inverse signal: both gemma3:27b and qwen2.5:14b fail it as a genuine multi-entity-reasoning content failure. gemma4 passes it. That single probe is the cleanest available evidence for "smarter" — it isolates reasoning quality from protocol compliance.

2026-04-29 dated companion table (Ollama 0.20.0)

A second clean-capture matrix run on 2026-04-29 produced this companion table. The setup is identical to the April-15 run; only the calendar date and Ollama state have moved:

ModelMode2026-04-152026-04-29
gemma3:27b5/65/6
qwen2.5:14b4/64/6
gemma4:26bunsuppressed4/65/6
gemma4:26b/no_think suppressed**6/6****6/6**
gemma4:31bunsuppressed**6/6**_rerun pending (see note)_
gemma4:31b/no_think suppressed**6/6**_rerun pending (see note)_

> Note on gemma4:31b rows: the 2026-04-29 rerun against gemma4:31b was > initiated as part of this matrix run but had not completed at freeze time. > gemma4:31b is a ceiling reference for this paper, not the deployment lane; > the bounded thesis rests on the gemma4:26b rows above. The fortnightly > regression cadence will pick up the missing 31b rows on its next scheduled > run; if a result lands earlier, this table will be updated post-freeze.

The April-15 framing of gemma4:26b unsuppressed as a 4/6 lane was already softened by the 2026-04-28 regression run (6/6) and is softened further by this 2026-04-29 row (5/6). The unsuppressed lane is consistent only in the sense that it consistently moves. The suppressed lane stays at 6/6 across both dates and is the contract production should anchor on.

gemma3:27b and qwen2.5:14b are stable across the two runs and continue to serve as the content-failure controls for PROTO_010.

The Handoff Discipline Solution

Two operational levers, both deployable today:

Lever 1: thinking-mode suppression

The /no_think prompt prefix tells gemma4 to skip its reasoning preamble and emit the answer directly. It is one token at the top of the prompt. On strict-protocol probes it moves gemma4:26b from 4/6 to 6/6.

That was the measured 2026-04-15 result. A later scheduled regression probe on 2026-04-28 under Ollama 0.20.0 found that the unsuppressed gap had closed on this machine: gemma4:26b scored 6/6 both unsuppressed and suppressed. The operational lesson is therefore slightly stronger than the April matrix alone suggested: suppression is a **defensive default**, not an untouchable law of nature. The contract that matters is not "always suppress" but "probe the lane often enough to detect drift before production trust is affected."

Lever 2: schema-bound output contract

The Ollama REST API exposes a format: "json" constraint that forces decoded output to be parseable JSON. Combined with an explicit schema in the prompt and a typed validator on the receiving side, the deployment surface becomes:


prompt: /no_think + schema  →  format: "json" enforced at decode  →
strict typed validator at receive  →  exit 2 (parse) or 3 (schema) on failure

A failure at any stage is observable, not a silent corruption of downstream state. The model is allowed to be verbose if it wants to — the deployment boundary is what stays disciplined.

Drift Addendum (2026-04-15 → 2026-04-28 → 2026-04-29, Ollama 0.20.0)

Three dated runs are now on record. The unsuppressed lane has moved on every run. The suppressed lane has stayed at 6/6 across all three. That is the load- bearing observation behind the doctrine.

ModelMode2026-04-152026-04-282026-04-29
gemma4:26bunsuppressed4/66/65/6
gemma4:26b/no_think suppressed6/66/66/6
gemma3:27bunsuppressed5/65/65/6
qwen2.5:14bunsuppressed4/6_not installed_4/6

Interpretation:

The 2026-04-29 failing probe in the unsuppressed run was PROTO_001, where gemma4:26b emitted decimal win percentages (0.556, 0.513) instead of percentage values (55.6, 51.3). The output was clean JSON — the model simply chose a different value contract than the schema asked for. That is the exact failure mode strict receive-side validation is designed to catch.

The upstream variable across these runs is Ollama 0.20.0. This is not enough evidence to assign causality narrowly to Ollama itself; it is enough to state that the live deployment surface keeps changing under fixed model and version labels.

The doctrine therefore freezes as follows: keep /no_think as the production default for gemma4:26b, but treat it as a **defensive default backed by a fortnightly regression probe**, not as a timeless requirement. The next probe is due 2026-05-13. The exact run command lives in docs/GEMMA_PROTOCOL_VERIFICATION_RUNBOOK.md.

Two Bounded Cases

The doctrine is supported by two bounded cases that exercise it from opposite directions.

The two cases are deliberately asymmetric. Case 1 demonstrates the doctrine at work in a successful deployment; Case 2 demonstrates why the doctrine is necessary in the first place. Together they bracket the same operational claim from above and below.

Case 1. DocDrop

The DocDrop privacy-doc analysis pipeline is the production-grade document extraction case for the handoff-discipline thesis. The use case is bounded but real:

The map step uses gemma4:26b with /no_think and format: "json". A receive-side validator type-checks five required fields (document_title, executive_summary, key_action_items, meeting_participants, sentiment_tone). The orchestrator advances its progress file only on RC=0; parse failures (exit 2) and schema failures (exit 3) are first-class signals — the orchestrator skips the document without corrupting the run.

This is exactly the deployment shape the thesis predicts. The smarter model now runs in a privacy-safe lane it would have been disqualified from under the original false read. Pipeline run and verified end-to-end against real meeting documents.

Case 2. PPR Lane 2 Tool Dispatch

The second bounded case is a local PPR_Agent query-parser surface where the model is not allowed to answer from memory. It emits one JSON tool-call object for a deterministic SQLite-backed medical-device substrate.

The strict deployment shape is the same:

This case matters because it exercises a different kind of machine-facing handoff than DocDrop. The output is not a five-field document summary. It is a bounded action object whose only job is to select a deterministic tool and typed argument set without corrupting the authority boundary.

This is an adversarial bounded handoff case, not a clean-success case. The strict PPR Lane 2 surface is implemented, evidenced, and validator-correct. The model-facing side is not "production-stable" in the DocDrop sense: across three runs of the same eight probes on the same Ollama version, gemma4:26b emits a different malformed parameter object on a different probe each time. That instability is the point — it is exactly what the receive-side validator exists to catch, and the captured packet shows it doing so on every observed failure at RC=3.

The packet exercises the eight canonical Lane 2 probes through ppr_ollama.py's strict surface and records, for each probe: the raw model response, the validated tool-call payload (or null on rejection), the normalized parameters, the exit code, and the latency.

The eight probes cover the parameter coverage matrix for the six legacy PPR tools, including one deliberate negative case:

ProbeCoverageExpected exit
PPR_LANE2_001scalars + optional category0
PPR_LANE2_002range + optional category0
PPR_LANE2_003list + range + alias normalization0
PPR_LANE2_004optional integers only0
PPR_LANE2_005range + multi-word company alias0
PPR_LANE2_006string search term0
PPR_LANE2_007full optional surface for get_top_devices0
PPR_LANE2_008invalid company in list (negative case)3

Three dated runs were captured during freeze (2026-04-29, Ollama 0.20.0):

RunStampPass rateNotes
1ppr_lane2_evidence_20260429T235929Z4/8exposed two prompt/validator inconsistencies (category vs device_category, query vs search_term); fixed in ppr_ollama.py SYSTEM_PROMPT
2ppr_lane2_evidence_20260430T000100Z5/8post-fix; remaining failures all genuine model-side schema drift
3ppr_lane2_evidence_20260430T000304Z3/8post-fix; same probe set, different failure distribution

Across the three runs, the validator never returned a false positive — every failure was either a real prompt-design bug (run 1) or a genuine model-side schema mutation (runs 2–3). The 2026-04-30 model-side failure modes the validator caught include:

These are exactly the failure modes a deterministic dispatch surface must defend against. Without strict receive-side validation, every one of them would either crash the SQLite layer or, worse, dispatch with the wrong typed arguments and return data that looked plausible but was wrong.

The headline observation is therefore not the pass rate — it is that the boundary is correctly engineered (every malformed payload was rejected with an observable exit code, and the deterministic substrate was never reached with one) while the model-facing side keeps mutating machine-facing parameters often enough that the validator is load-bearing rather than decorative. That is the bounded operational claim the paper makes for Case 2: the boundary works, and the boundary is necessary, both demonstrated on the same packet.

Per-probe artifacts and manifests live under domains/DemoAgents/PPR_Agent/benchmark/results/. The runner is scripts/ppr_lane2_evidence_run.py and the canonical probe set lives at domains/DemoAgents/PPR_Agent/benchmark/queries/ppr_lane2_canonical_probes.json.

What This Paper Is Not

What This Paper Is

What Would Strengthen The Paper Further

The freeze-time strengtheners (regression probe, dated matrix rerun, PPR evidence packet) all landed and are referenced inline above. The remaining items are post-freeze strengtheners — useful, but not load-bearing for the bounded thesis as frozen:

References

Published as part of the Bulkhead τ release line. Paper inventory: /papers/.