Direct local-model solving vs. orchestrated solver-backed path on fixed Arizona fixtures.
The simplest evidence: hold the workload fixed, vary only the model, and compare the direct path against the orchestrated path.
| Instance | Cities | gemma4:26b | gemma4:e4b | gemma3:27b | qwen2.5:14b | llama3.1:8b | llama3.2:3b | Orch correct? |
|---|---|---|---|---|---|---|---|---|
| tsp-003 | 6 | exact tie | exact tie | gap 4.19 | — | — | — | yes |
| tsp-004 | 8 | exact tie | gap 2.11 | exact tie | duplicate city | missing city | duplicate city | yes |
| tsp-005 | 10 | exact tie | gap 3.82 | gap 5.43 | gap 18.42 | gap 20.85 | duplicate Phoenix | yes |
gemma4:26b is the first local model to achieve exact ties across the full Arizona ladder. gemma4:e4b cracked the greedy trap (tsp-003) that gemma3:27b could not. The orchestrated path remained correct across every model and every instance.
Two gemma4 MoE models were added to the ladder — both run on an RTX 3090 via Ollama with default quantization.
11 GB loaded · ~4B active parameters
| Instance | Cities | Direct valid? | Direct gap | Orch exact? | Direct time | Orch time |
|---|---|---|---|---|---|---|
| tsp-001 | 4 | yes | exact tie | yes | 15.210s | 10.038s |
| tsp-002 | 5 | yes | gap 1.769079 | yes | 28.597s | 9.829s |
| tsp-003 | 6 | yes | exact tie | yes | 29.426s | 10.002s |
| tsp-004 | 8 | yes | gap 2.10799 | yes | 54.150s | 6.356s |
| tsp-005 | 10 | yes | gap 3.819021 | yes | 32.654s | 10.292s |
Zero structural failures. Cracked the greedy trap (tsp-003) exactly — gemma3:27b had a gap of 4.185 there. Mixed quality elsewhere vs gemma3:27b. Collapsed at world-100 with 9 missing cities, same count as gemma3:27b.
17 GB loaded · ~26B total / 4B active (larger gate) · first local model to achieve 5/5 exact ties
| Instance | Cities | Direct valid? | Direct gap | Orch exact? | Direct time | Orch time |
|---|---|---|---|---|---|---|
| tsp-001 | 4 | yes | exact tie | yes | 7.059s | 10.172s |
| tsp-002 | 5 | yes | exact tie | yes | 100.852s | 10.384s |
| tsp-003 | 6 | yes | exact tie | yes | 123.816s | 11.315s |
| tsp-004 | 8 | yes | exact tie | yes | 181.311s | 10.575s |
| tsp-005 | 10 | yes | exact tie | yes | 292.824s | 12.430s |
Perfect direct quality across all five Arizona fixtures — the first local model to achieve this. The cost: direct solve time scales from 7s (4 cities) to 293s (10 cities). At world-100, it dropped only 2 cities (Hanoi, Busan) — the closest any local model has come to structural validity at that scale. The orchestrated path stays flat at 10–16s regardless of instance size.
Fixture: tsp-004 · az_large · 8 cities · brute-force exact solver
| Model | Direct valid? | Direct outcome | Orch exact? | Direct time | Orch time |
|---|---|---|---|---|---|
| gemma3:27b | yes | exact tie | yes | 20.394s | 14.926s |
| qwen2.5:14b | no | duplicate-city route | yes | 28.359s | 5.852s |
| llama3.1:8b | no | missing-city route | yes | 8.998s | 3.474s |
| llama3.2:3b | no | duplicate-city route | yes | 5.918s | 1.843s |
At 8 cities, direct local-model solving already varies from exact to structurally invalid. The orchestrated path stayed exact across all tested models.
Fixture: tsp-005 · az_large · 10 cities · brute-force exact solver
| Model | Direct valid? | Direct outcome | Orch exact? | Direct time | Orch time |
|---|---|---|---|---|---|
| gemma3:27b | yes | gap 5.432581 | yes | 25.065s | 12.990s |
| qwen2.5:14b | yes | gap 18.416634 | yes | 14.665s | 8.215s |
| llama3.1:8b | yes | gap 20.846102 | yes | 9.132s | 4.806s |
| llama3.2:3b | no | duplicate Phoenix | yes | 5.636s | 3.244s |
At 10 cities, even the stronger direct models remain suboptimal. The weakest model becomes structurally invalid. The orchestrated path stayed exact across all tested models.
The stronger direct model held across every Arizona case — but still drifted from optimal on harder instances.
| Arizona case | Instance | Cities | Direct valid? | Direct outcome | Orch exact? |
|---|---|---|---|---|---|
| Easy perimeter | tsp-001 | 4 | yes | exact tie | yes |
| Clustered | tsp-002 | 5 | yes | gap 0.724103 | yes |
| Greedy trap | tsp-003 | 6 | yes | gap 4.185364 | yes |
| Larger fixture | tsp-004 | 8 | yes | exact tie | yes |
| Larger fixture | tsp-005 | 10 | yes | gap 5.432581 | yes |
The direct path can stay structurally valid while still drifting away from optimal. The orchestrated path stayed exact across the whole fixture ladder.
The smallest tested model shows the opposite edge: structural failure begins early once the fixture becomes even slightly more demanding.
| Arizona case | Instance | Cities | Direct valid? | Direct outcome | Orch exact? |
|---|---|---|---|---|---|
| Easy perimeter | tsp-001 | 4 | yes | exact tie | yes |
| Clustered | tsp-002 | 5 | no | missing-city route | yes |
| Greedy trap | tsp-003 | 6 | no | missing-city route | yes |
| Larger fixture | tsp-004 | 8 | no | duplicate Phoenix | yes |
| Larger fixture | tsp-005 | 10 | no | duplicate Phoenix | yes |
The smallest model can still succeed on the very smallest case. The orchestrated path stayed exact across the whole fixture ladder regardless.
Strongest direct model in the Arizona slice. Still not a reliable reason to make the model the solver — even a best-case result carries workload sensitivity.
Can stay structurally valid while still drifting far from the optimal route. Structural validity and solution quality are separate concerns.
Stayed structurally valid on the 10-city slice but direct quality degraded materially as the workload grew.
Fast, but becomes structurally invalid on both fixed comparison slices. Speed advantage does not offset correctness failure.
When the model is asked to directly solve the optimization problem, quality varies materially by model and workload.
When the model is used to interpret, route, and explain around a deterministic solver, correctness stays stable.
One-sentence version: the LLM is useful — just not as the solver.