Local models
structural collapse
Drop cities. Produce invalid tours. This failure is not gradual — it is a hard break in route validity.
Codex Frontier vs. Claude Frontier on the same TSP ladder — two rounds, bounded operator-side comparison.
This is a bounded case study, not a formal benchmark. Timing and comparability conditions are not identical to the local Ollama runs. The frontier rows are preserved as operator-side workflow evidence, not clean cross-table rows.
The local-model results already show the Phoenix conclusion: at 100 world cities, every tested local model collapsed structurally while the orchestrated path remained correct.
The frontier-anchor notes add an important qualification. Stronger models push the first degradation point farther out. At world scale:
structural collapse
Drop cities. Produce invalid tours. This failure is not gradual — it is a hard break in route validity.
quality drift
Remain structurally valid at world scale but diverge materially from the deterministic heuristic baseline.
The failure mode shifts. The architectural conclusion survives both regimes.
Two bounded frontier assessments run independently on the same fixture set. Neither run is a clean shared-table row — timing and methodology were not matched to the Ollama local runs.
| Instance | Cities | Codex Frontier |
Claude Frontier |
Orch correct? |
|---|---|---|---|---|
tsp-004 |
8 | gap 0.0 — exact | gap 0.0 — exact | yes |
tsp-005 |
10 | gap 0.0 — exact | gap 0.0 — exact | yes |
tsp-006 |
20 | gap 0.0 — exact | gap 0.0 (contaminated run) | yes |
tsp-007 |
50 | not attempted | 1,484.2 mi gap (12.8%) | yes |
tsp-008 |
100 | 8,761.4 mi vs heuristic | 4,996.3 mi vs heuristic (7.5%) | yes |
Both frontier models stayed structurally valid across the full tested ladder. Claude Frontier produced the better world-scale direct route quality in the observed run. Neither run changed the orchestrated-path conclusion.
Task: assess the fixed TSP ladder for a frontier anchor and preserve results. Common ladder target: tsp-004, tsp-005, tsp-006, optional tsp-007.
| Agent | Wall-clock | 5h session usage | Weekly usage | User interventions |
|---|---|---|---|---|
Codex Frontier |
~2 min | 7% | 2% | none |
Claude Frontier |
~5 min | 10% | 1% | one intervention |
Operator-observed session percentages, not provider token logs. Useful operational signal, not exact cross-provider token accounting.
Same general pattern: both agents asked to extend the ladder to the 100-city world rung.
| Agent | Wall-clock | Weekly usage | User interventions |
|---|---|---|---|
Codex Frontier |
under 2 min | 4% | none |
Claude Frontier |
~3 min | 8% | one approval |
Pattern holds across both rounds. Codex Frontier faster and lower friction. Claude Frontier took longer and needed intervention both times. Same interpretation: useful workflow signal, not exact token accounting.
Codex FrontierWorkflow efficiency win: faster in both rounds, lower observed weekly budget usage, zero interventions required. Better fit for burst-oriented or lower-supervision workflows.
Claude FrontierRoute quality win: produced the better world-scale direct route in the observed run (~7.5% gap vs heuristic vs ~14.5% for Codex). Worth the overhead when final route quality matters more than throughput.
This is a case-study workflow claim, not a universal model ranking. Two rounds on one bounded task domain.
Despite different workflow profiles, both frontier assessments converged on the same high-level outcomes:
The TSP lesson stayed stable even with frontier capability in the loop.
Stronger direct solving can delay or reduce obvious failure. That does not remove the architectural decision. The professional architecture question remains separate from raw direct capability.
The Phoenix line still holds:
correctness should live in the solver, not in the model stronger models delay failure they do not eliminate the need for solver-backed architecture
This page should be used as:
This page should not be used as: