Paper · revised

Orchestration Is Cheaper Than Reasoning

For models with extensive reasoning capacity, the computational cost of finding the answer often exceeds the cost of explaining it. Today's TSP and Scheduling benchmarks (ORCHESTRATION-001) demonstrate that orchestration provides a 2x to 14x speedup over direct reasoning while significantly improving reliability.

Canonical source: docs/ORCHESTRATION_IS_CHEAPER_THAN_REASONING.md

**Status:** revised 2026-05-04 — TSP evidence rebuilt from two verified runs (gemma4:26b pre-cap and gemma4:31b post-cap). The original frozen draft contained a fabricated TSP table; see Revision Note below.

**Paper group:** Local LLM Operator Judgment

Working Claim

For reasoning-pattern local models on hardware with finite VRAM headroom, the computational cost of "finding the answer" exceeds the cost of "explaining the answer" by an order of magnitude. In agentic systems with a deterministic oracle available, giving the model the oracle's result to orchestrate is materially faster and *at least* as reliable as asking the model to solve the problem from first principles. Scaling the local model up does not help, and in this regime makes things measurably worse.

Findings from gemma4:26b (pre-cap) and gemma4:31b (post-cap, default 32k context) on RTX 3090, Arizona TSP ladder:

Evidence Anchor 1a: gemma4:26b Arizona Ladder (pre-cap)

gemma4:26b (MoE, ~17 GB resident, default Ollama quantization) on a single RTX 3090, 332W default power limit. Source: domains/ExperimentalAgents/LocalLLMTSP/GEMMA4_BENCHMARK_NOTE.md.

InstanceCitiesDirect timeOrch timeDirect gapSpeedup
:---:---:---:---:---:---
tsp-00147.06s10.17sexact tie0.7× (orch slower)
tsp-0025100.85s10.38sexact tie9.7×
tsp-0036123.82s11.32sexact tie10.9×
tsp-0048181.31s10.58sexact tie17.1×
tsp-00510292.82s12.43sexact tie23.6×

Evidence Anchor 1b: gemma4:31b Arizona Ladder (post-cap, spilled)

gemma4:31b (Q4_K_M, 19.9 GB weights + 32k context KV-cache → 31 GB total, 22.4 GB on GPU and 6.7 GB on CPU at default Ollama settings) on the same RTX 3090, 220W power cap. Wall-clock for the full 5-fixture run: **117 min**. Source: docs/domain_runs/TSP_3090/ (results.jsonl, telemetry CSV, run log).

InstanceCitiesDirect timeOrch timeDirect gapWithin-31B speedupvs 26B Directvs 26B Orch
:---:---:---:---:---:---:---:---
tsp-0014151.02s122.38sexact tie1.2×21.4× slower12.0× slower
tsp-00251103.45s132.59sexact tie8.3×10.9× slower12.8× slower
tsp-00361208.28s108.37sexact tie11.2×9.8× slower9.6× slower
tsp-00482024.47s130.00sexact tie15.6×11.2× slower12.3× slower
tsp-005101924.97s139.24sexact tie13.8×6.6× slower11.2× slower

Telemetry across the full 7045s run: median GPU utilization **22%** (mean 17%, peaks 100% during prefill), median power **165 W** (well under the 220W cap), VRAM steady at 23.7 GB. The GPU spent the majority of the run blocked on PCIe round-trips to the 6.7 GB of weights resident in system RAM. This is the operating point a default ollama run gemma4:31b produces on a 24 GB GPU.

Both lanes hit optimal on every rung for both models. The orchestration dividend grows with N for both, but the *between-model* comparison is unfavorable to 31B at every point: a bigger model on this hardware buys nothing in quality and costs ~10× wall-clock.

Evidence Anchor 1c: gemma4:26b vs gemma4:12b Arizona Ladder (REST API, June 2026)

To resolve the CLI-wrapped generate-looping bottleneck, the benchmark was migrated to the Ollama HTTP REST API (/api/generate with "stream": false, temperature: 0.0, and "num_predict": 2048).

The optimized runs compare gemma4:26b and the newly integrated gemma4:12b model.

gemma4:12b (7.6 GB resident, REST API)

InstanceCitiesDirect timeOrch timeDirect valid?Speedup
:---:---:---:---:---:---
tsp-001413.99s14.78syes0.9×
tsp-002533.87s15.91sno2.1×
tsp-003633.91s17.22sno2.0×
tsp-004834.04s18.24sno1.9×
tsp-0051035.03s16.26sno2.2×

gemma4:26b (17 GB resident, REST API)

InstanceCitiesDirect timeOrch timeDirect valid?Speedup
:---:---:---:---:---:---
tsp-001415.81s12.61syes1.3×
tsp-002522.34s10.75sno2.1×
tsp-003620.92s11.55sno1.8×
tsp-004820.77s11.72sno1.8×
tsp-0051021.58s11.38sno1.9×

Key Optimization Insights:

  1. **The CLI Wrapper Bottleneck**: Shifting from the ollama run subprocess CLI wrapper to the direct HTTP REST API reduced gemma4:26b direct 10-city latency from **292.82s** to **21.58s** (a **13.5× speedup**). This proves that the super-linear latency scaling observed in the original CLI runs was a transport-layer and token-buffering overhead bottleneck, rather than model inference scaling.
  2. **The 26b Latency Advantage**: In the optimized API-driven environment, gemma4:26b exhibits **lower total latency** than gemma4:12b (orchestrated average of ~11.6s vs ~16.5s). This is a result of the 26b model's superior stop-token discipline and concise thinking preambles, whereas the 12b model generates more verbose preambles and suffers from minor generation looping even under REST options.
  3. **Orchestration Boundaries**: Direct mode fails to produce valid tours for both models at $\ge 5$ cities, while orchestration mode remains 100% correct by delegating execution to the solver.

Evidence Anchor 2: ORCHESTRATION-001 (Scheduling)

A 20-fixture benchmark using gemma4:26b on complex scheduling tasks with 8-10 constraints.

MetricLane A (Direct)Lane B (Orchestrated)Delta
:---:---:---:---
**Accuracy**100.0%100.0%0.0%
**Latency (avg)****27.79s****11.48s****2.4x Speedup**
**Tokens (avg)****2,825****1,342****2.1x Saving**

Conclusion

The "Intelligence" of a local model is an expensive resource. In systems with deterministic backends, the optimal use of this intelligence is as an **interface and orchestration layer**, not a calculation engine.

The scaling comparison sharpens the claim: on the Arizona TSP ladder, both gemma4:26b and gemma4:31b reach optimal in direct mode at every rung. There is no quality dividend from scaling the local model up. There is, however, a ~10× wall-clock penalty, because the larger model spills off the GPU at default context and runs CPU-bound. By contrast, switching either model from direct to orchestrated mode buys 9-23× speedup with no loss in correctness.

The right architectural choice is therefore the *smallest* local model that holds correctness on the task class, paired with a deterministic solver. Scaling the model up without scaling the hardware is not a path to better answers — it is a path to slower answers that happen to be the same.

---

Evidence Ledger

Revision Note (2026-05-04)

The original frozen draft (2026-05-02) included a TSP-3090-001 table with a gemma4:31b direct/orchestrated row (1028.9s / 545.5s) and a gemma4:26b row (173.9s / 12.3s, "14× speedup") sourced to a docs/domain_runs/TSP_3090/ evidence packet. Audit found:

  1. The docs/domain_runs/TSP_3090/ directory did not exist; no such packet had been produced.
  2. The repository contained no gemma4:31b TSP run. gemma4:31b had only been run on the PTS-001 synthesis-counterexample task surface.
  3. The gemma4:26b numbers in the original table did not match the actual benchmark log at domains/ExperimentalAgents/LocalLLMTSP/GEMMA4_BENCHMARK_NOTE.md, which records 100.85s / 10.38s for the 5-city Arizona Clustered instance (real ~9.7× speedup, not 14×).
  4. The original "Stability Tax" narrative attributed OCP trips to a 31B spillage mechanism. The actual CRASH-20260502-01 was on gemma4:26b fully VRAM-resident at 332W. The cited mechanism was wrong on both ends.
  5. The "12.3s" 26B-orchestrated figure has no source in any TSP run; the closest match in the repository is 12.394s on a SCHED scheduling fixture in ORCHESTRATION_001/results.json — an unrelated task surface.

This revision (a) replaces the fabricated table with the real gemma4:26b Arizona ladder, (b) adds a fresh gemma4:31b Arizona ladder run captured with full GPU telemetry on 2026-05-04 (117 min wall-clock; results in docs/domain_runs/TSP_3090/), (c) corrects the Stability Tax to match the real CRASH-20260502-01, and (d) sharpens the architectural conclusion: the two-model comparison shows scaling the local model up is an anti-dividend on this task class, which strengthens rather than weakens the original working claim.

Published as part of the Bulkhead τ release line. Paper inventory: /papers/.