Harness First
Thinking-mode models require clean-output capture. Terminal subprocess capture is not canonical for protocol evaluation.
The current protocol-comparison line is a harness finding before it is a model ranking.
Legacy ollama run subprocess capture overstated thinking-mode protocol failures. Clean REST API capture changes the ranking materially. This is the core finding of Paper 1.16.
| Model | Mode | Capture | Pass | Current read |
|---|---|---|---|---|
gemma3:27b | unsuppressed | ollama_api | 5/6 | Strong baseline; PROTO_010 remains a content miss. |
qwen2.5:14b | unsuppressed | ollama_api | 4/6 | Useful contrast; weaker than Gemma 3 and Gemma 4:31b in this slice. |
gemma4:26b | unsuppressed | ollama_api | 4/6 | Flat-schema gap remains without suppression. |
gemma4:26b | suppressed | ollama_api | 6/6 | Full pass; suppression resolves the flat-schema issue. |
gemma4:31b | unsuppressed | ollama_api | 6/6 | Strongest current local protocol lane. |
gemma4:31b | suppressed | ollama_api | 6/6 | Same result; suppression not required. |
Thinking-mode models require clean-output capture. Terminal subprocess capture is not canonical for protocol evaluation.
gemma4:31b is the strongest current local result in this protocol lane under corrected capture. The earlier severe regression story was a measurement artifact.
PROTO_010 remains a genuine content failure outside the capture problem. Not every miss was a harness artifact.
| Paper | Role | Current relevance |
|---|---|---|
| Paper 1.16 | Primary paper | Defines the capture-integrity correction. The full inventory remains in portfolio order; this paper is featured because it changes the current protocol line. |
| Operator Shell Pattern | Architecture paper | Explains how model-comparison packets fit into the OpenClaw outer layer without crossing the authority boundary. |