Frontier Anchor Case Study

Frontier

Codex Frontier vs. Claude Frontier on the same TSP ladder — two rounds, bounded operator-side comparison.

This is a bounded case study, not a formal benchmark. Timing and comparability conditions are not identical to the local Ollama runs. The frontier rows are preserved as operator-side workflow evidence, not clean cross-table rows.

What The Frontier Adds

The local-model results already show the Phoenix conclusion: at 100 world cities, every tested local model collapsed structurally while the orchestrated path remained correct.

The frontier-anchor notes add an important qualification. Stronger models push the first degradation point farther out. At world scale:

Local models

structural collapse

Drop cities. Produce invalid tours. This failure is not gradual — it is a hard break in route validity.

Frontier models

quality drift

Remain structurally valid at world scale but diverge materially from the deterministic heuristic baseline.

The failure mode shifts. The architectural conclusion survives both regimes.

Frontier Ladder Results

Two bounded frontier assessments run independently on the same fixture set. Neither run is a clean shared-table row — timing and methodology were not matched to the Ollama local runs.

Instance Cities Codex Frontier Claude Frontier Orch correct?
tsp-004 8 gap 0.0 — exact gap 0.0 — exact yes
tsp-005 10 gap 0.0 — exact gap 0.0 — exact yes
tsp-006 20 gap 0.0 — exact gap 0.0 (contaminated run) yes
tsp-007 50 not attempted 1,484.2 mi gap (12.8%) yes
tsp-008 100 8,761.4 mi vs heuristic 4,996.3 mi vs heuristic (7.5%) yes

Both frontier models stayed structurally valid across the full tested ladder. Claude Frontier produced the better world-scale direct route quality in the observed run. Neither run changed the orchestrated-path conclusion.

Round 1 — Operator Workflow Comparison

Task: assess the fixed TSP ladder for a frontier anchor and preserve results. Common ladder target: tsp-004, tsp-005, tsp-006, optional tsp-007.

Agent Wall-clock 5h session usage Weekly usage User interventions
Codex Frontier ~2 min 7% 2% none
Claude Frontier ~5 min 10% 1% one intervention

Operator-observed session percentages, not provider token logs. Useful operational signal, not exact cross-provider token accounting.

Round 2 — World-Scale Extension

Same general pattern: both agents asked to extend the ladder to the 100-city world rung.

Agent Wall-clock Weekly usage User interventions
Codex Frontier under 2 min 4% none
Claude Frontier ~3 min 8% one approval

Pattern holds across both rounds. Codex Frontier faster and lower friction. Claude Frontier took longer and needed intervention both times. Same interpretation: useful workflow signal, not exact token accounting.

Two-Round Case Study Tradeoff

Codex Frontier

Workflow efficiency win: faster in both rounds, lower observed weekly budget usage, zero interventions required. Better fit for burst-oriented or lower-supervision workflows.

Claude Frontier

Route quality win: produced the better world-scale direct route in the observed run (~7.5% gap vs heuristic vs ~14.5% for Codex). Worth the overhead when final route quality matters more than throughput.

This is a case-study workflow claim, not a universal model ranking. Two rounds on one bounded task domain.

What Both Runs Agreed On

Despite different workflow profiles, both frontier assessments converged on the same high-level outcomes:

TSP Interpretation

The TSP lesson stayed stable even with frontier capability in the loop.

Stronger direct solving can delay or reduce obvious failure. That does not remove the architectural decision. The professional architecture question remains separate from raw direct capability.

The Phoenix line still holds:

correctness should live in the solver, not in the model

stronger models delay failure
they do not eliminate the need for solver-backed architecture

Boundary

This page should be used as:

This page should not be used as: