Local Model Results

Arizona Ladder

Direct local-model solving vs. orchestrated solver-backed path on fixed Arizona fixtures.

The simplest evidence: hold the workload fixed, vary only the model, and compare the direct path against the orchestrated path.

Compact Arizona Summary

Instance Cities gemma4:26b gemma4:e4b gemma3:27b qwen2.5:14b llama3.1:8b llama3.2:3b Orch correct?
tsp-003 6 exact tie exact tie gap 4.19 yes
tsp-004 8 exact tie gap 2.11 exact tie duplicate city missing city duplicate city yes
tsp-005 10 exact tie gap 3.82 gap 5.43 gap 18.42 gap 20.85 duplicate Phoenix yes

gemma4:26b is the first local model to achieve exact ties across the full Arizona ladder. gemma4:e4b cracked the greedy trap (tsp-003) that gemma3:27b could not. The orchestrated path remained correct across every model and every instance.

Gemma 4 Results

Two gemma4 MoE models were added to the ladder — both run on an RTX 3090 via Ollama with default quantization.

gemma4:e4b — Arizona Ladder

11 GB loaded · ~4B active parameters

Instance Cities Direct valid? Direct gap Orch exact? Direct time Orch time
tsp-0014 yes exact tie yes 15.210s10.038s
tsp-0025 yes gap 1.769079 yes 28.597s9.829s
tsp-0036 yes exact tie yes 29.426s10.002s
tsp-0048 yes gap 2.10799 yes 54.150s6.356s
tsp-00510 yes gap 3.819021 yes 32.654s10.292s

Zero structural failures. Cracked the greedy trap (tsp-003) exactly — gemma3:27b had a gap of 4.185 there. Mixed quality elsewhere vs gemma3:27b. Collapsed at world-100 with 9 missing cities, same count as gemma3:27b.

gemma4:26b — Arizona Ladder

17 GB loaded · ~26B total / 4B active (larger gate) · first local model to achieve 5/5 exact ties

Instance Cities Direct valid? Direct gap Orch exact? Direct time Orch time
tsp-0014 yes exact tie yes 7.059s10.172s
tsp-0025 yes exact tie yes 100.852s10.384s
tsp-0036 yes exact tie yes 123.816s11.315s
tsp-0048 yes exact tie yes 181.311s10.575s
tsp-00510 yes exact tie yes 292.824s12.430s

Perfect direct quality across all five Arizona fixtures — the first local model to achieve this. The cost: direct solve time scales from 7s (4 cities) to 293s (10 cities). At world-100, it dropped only 2 cities (Hanoi, Busan) — the closest any local model has come to structural validity at that scale. The orchestrated path stays flat at 10–16s regardless of instance size.

8-City Arizona Comparison

Fixture: tsp-004 · az_large · 8 cities · brute-force exact solver

Model Direct valid? Direct outcome Orch exact? Direct time Orch time
gemma3:27b yes exact tie yes 20.394s 14.926s
qwen2.5:14b no duplicate-city route yes 28.359s 5.852s
llama3.1:8b no missing-city route yes 8.998s 3.474s
llama3.2:3b no duplicate-city route yes 5.918s 1.843s

At 8 cities, direct local-model solving already varies from exact to structurally invalid. The orchestrated path stayed exact across all tested models.

10-City Arizona Comparison

Fixture: tsp-005 · az_large · 10 cities · brute-force exact solver

Model Direct valid? Direct outcome Orch exact? Direct time Orch time
gemma3:27b yes gap 5.432581 yes 25.065s 12.990s
qwen2.5:14b yes gap 18.416634 yes 14.665s 8.215s
llama3.1:8b yes gap 20.846102 yes 9.132s 4.806s
llama3.2:3b no duplicate Phoenix yes 5.636s 3.244s

At 10 cities, even the stronger direct models remain suboptimal. The weakest model becomes structurally invalid. The orchestrated path stayed exact across all tested models.

gemma3:27b — Full Arizona Fixture Ladder

The stronger direct model held across every Arizona case — but still drifted from optimal on harder instances.

Arizona case Instance Cities Direct valid? Direct outcome Orch exact?
Easy perimeter tsp-001 4 yes exact tie yes
Clustered tsp-002 5 yes gap 0.724103 yes
Greedy trap tsp-003 6 yes gap 4.185364 yes
Larger fixture tsp-004 8 yes exact tie yes
Larger fixture tsp-005 10 yes gap 5.432581 yes

The direct path can stay structurally valid while still drifting away from optimal. The orchestrated path stayed exact across the whole fixture ladder.

llama3.2:3b — Full Arizona Fixture Ladder

The smallest tested model shows the opposite edge: structural failure begins early once the fixture becomes even slightly more demanding.

Arizona case Instance Cities Direct valid? Direct outcome Orch exact?
Easy perimeter tsp-001 4 yes exact tie yes
Clustered tsp-002 5 no missing-city route yes
Greedy trap tsp-003 6 no missing-city route yes
Larger fixture tsp-004 8 no duplicate Phoenix yes
Larger fixture tsp-005 10 no duplicate Phoenix yes

The smallest model can still succeed on the very smallest case. The orchestrated path stayed exact across the whole fixture ladder regardless.

Model-by-Model Reading

gemma3:27b

Strongest direct model in the Arizona slice. Still not a reliable reason to make the model the solver — even a best-case result carries workload sensitivity.

qwen2.5:14b

Can stay structurally valid while still drifting far from the optimal route. Structural validity and solution quality are separate concerns.

llama3.1:8b

Stayed structurally valid on the 10-city slice but direct quality degraded materially as the workload grew.

llama3.2:3b

Fast, but becomes structurally invalid on both fixed comparison slices. Speed advantage does not offset correctness failure.

Phoenix Point

When the model is asked to directly solve the optimization problem, quality varies materially by model and workload.

When the model is used to interpret, route, and explain around a deterministic solver, correctness stays stable.

One-sentence version: the LLM is useful — just not as the solver.