Project Phoenix | Tau-Bench

The Problem

State-of-the-art function-calling agents still fail a large share of realistic tool-use tasks when measured end-to-end.

Tau-Bench highlights that reliability and policy adherence are system design problems, not just model-size problems.

Reference: taubench.com/#home

Tau-Bench Core Principles

SOP-grounded planning: break complex tasks into explicit, reproducible procedures.
Policy-first execution: enforce business or domain rules as explicit constraints.
Stateful evaluation: assess multi-turn tool workflows, not isolated prompts.
Consistency over retries: track repeatability (pass^k), not one lucky run.

Tau-Bench Challenges -> Phoenix Solutions

Tau-Bench Challenge	Phoenix Solution
Long-context reasoning and planning	SOP-driven architecture, Hypothesize -> Plan -> Execute with preview and checkpoints
Accurately adhere to complex policies	`rules.py` constraints, deterministic tool contracts, and explicit refusal conditions
Maintain consistency at scale (pass^k)	Write-Then-Verify mandate, golden-file baselines, and regression validation suites

Construction Method Alignment

Stage I: Manual schema, APIs, and policies -> tools and rules design.
Stage II: Automatic data generation -> initial database creation.
Stage III: Manual task annotation -> conversational probing and Q/A sets.