LocalLLMTSP · Frontier Case Study

This is a bounded case study, not a formal benchmark. Timing and comparability conditions are not identical to the local Ollama runs. The frontier rows are preserved as operator-side workflow evidence, not clean cross-table rows.

What The Frontier Adds

The local-model results already show the Phoenix conclusion: at 100 world cities, every tested local model collapsed structurally while the orchestrated path remained correct.

The frontier-anchor notes add an important qualification. Stronger models push the first degradation point farther out. At world scale:

Local models

structural collapse

Drop cities. Produce invalid tours. This failure is not gradual — it is a hard break in route validity.

Frontier models

quality drift

Remain structurally valid at world scale but diverge materially from the deterministic heuristic baseline.

The failure mode shifts. The architectural conclusion survives both regimes.

Frontier Ladder Results

Two bounded frontier assessments run independently on the same fixture set. Neither run is a clean shared-table row — timing and methodology were not matched to the Ollama local runs.

Instance	Cities	`Codex Frontier`	`Claude Frontier`	Orch correct?
`tsp-004`	8	gap 0.0 — exact	gap 0.0 — exact	yes
`tsp-005`	10	gap 0.0 — exact	gap 0.0 — exact	yes
`tsp-006`	20	gap 0.0 — exact	gap 0.0 (contaminated run)	yes
`tsp-007`	50	not attempted	1,484.2 mi gap (12.8%)	yes
`tsp-008`	100	8,761.4 mi vs heuristic	4,996.3 mi vs heuristic (7.5%)	yes

Both frontier models stayed structurally valid across the full tested ladder. Claude Frontier produced the better world-scale direct route quality in the observed run. Neither run changed the orchestrated-path conclusion.

Round 1 — Operator Workflow Comparison

Task: assess the fixed TSP ladder for a frontier anchor and preserve results. Common ladder target: tsp-004, tsp-005, tsp-006, optional tsp-007.

Agent	Wall-clock	5h session usage	Weekly usage	User interventions
`Codex Frontier`	~2 min	7%	2%	none
`Claude Frontier`	~5 min	10%	1%	one intervention

Operator-observed session percentages, not provider token logs. Useful operational signal, not exact cross-provider token accounting.

Round 2 — World-Scale Extension

Same general pattern: both agents asked to extend the ladder to the 100-city world rung.

Agent	Wall-clock	Weekly usage	User interventions
`Codex Frontier`	under 2 min	4%	none
`Claude Frontier`	~3 min	8%	one approval

Pattern holds across both rounds. Codex Frontier faster and lower friction. Claude Frontier took longer and needed intervention both times. Same interpretation: useful workflow signal, not exact token accounting.

Two-Round Case Study Tradeoff

`Codex Frontier`

Workflow efficiency win: faster in both rounds, lower observed weekly budget usage, zero interventions required. Better fit for burst-oriented or lower-supervision workflows.

`Claude Frontier`

Route quality win: produced the better world-scale direct route in the observed run (~7.5% gap vs heuristic vs ~14.5% for Codex). Worth the overhead when final route quality matters more than throughput.

This is a case-study workflow claim, not a universal model ranking. Two rounds on one bounded task domain.

What Both Runs Agreed On

Despite different workflow profiles, both frontier assessments converged on the same high-level outcomes:

Both preserved clean direct solves for the smaller exact cases
Both concluded that the Phoenix architecture lesson did not collapse at frontier scale
Both refused to force the frontier rows into the shared local-model comparison table, because comparability was not clean enough
The orchestrated path matched the deterministic solver on every tested fixture, for both agents

TSP Interpretation

The TSP lesson stayed stable even with frontier capability in the loop.

Stronger direct solving can delay or reduce obvious failure. That does not remove the architectural decision. The professional architecture question remains separate from raw direct capability.

The Phoenix line still holds:

correctness should live in the solver, not in the model

stronger models delay failure
they do not eliminate the need for solver-backed architecture

Boundary

This page should be used as:

A case-study data point
An operator comparison note across two frontier agents
A workflow discussion artifact

This page should not be used as:

A general cross-provider benchmark
An exact token-usage comparison
A broad claim about all future frontier tasks