Tau-Bench Principles
Tool-Agent-User Interaction Benchmark for Reliable Analytics
What is Tau-Bench?
Tau-Bench (Tool-Agent-User) is a benchmark framework for evaluating AI agents on their ability to interact with users, use tools/APIs, and follow domain-specific rules. TennisAgent implements tau-bench principles to ensure deterministic, reliable, and verifiable results when analyzing your tennis practice data.
Core Principles
Pass^k Reliability
A query succeeds only if it produces correct results in ALL k attempts. TennisAgent tools are deterministic - "analyze my best session" returns identical results every time with the same underlying data.
Domain Policy Following
The agent follows 21 rules governing data integrity, query processing, and tool execution. Tennis-specific constraints ensure valid sensor readings and proper session linking.
User Simulation
Tau-bench tests agents with dynamic user behavior - incomplete info, changed requests, compound queries. TennisAgent handles "show my best session, actually make it last week's" gracefully.
Database State Evaluation
Success is measured by comparing expected vs actual database states after queries. Binary pass/fail - no partial credit for "close enough" analytics.
Structured Tool Use
All 100+ tools have typed parameters and explicit return schemas. detect_swings(session_id: string, threshold: number) returns SwingResult - no ambiguous outputs.
Compound Request Handling
Multi-part queries must be fully resolved. "Analyze my best session AND compare it to last week" requires both operations - partial completion is failure.
Common Failure Modes
Tau-bench identifies three key challenges that TennisAgent is designed to avoid:
Wrong Argument
Using incorrect parameters in tool calls. Example: passing "AppleWatch" when the database expects "watch" as device filter.
Wrong Decision
Logical errors despite correct information. Example: returning Zepp data when user requested Apple Watch session.
Partial Resolution
Completing only part of compound requests. Example: finding best session but forgetting to also visualize it as requested.
Tau-Bench Implementation
| Component | Tau-Bench Description | TennisAgent Implementation |
|---|---|---|
| Database | Structured data storage | SQLite: tennis_watch.db, ztennis.db, BabPopExt.db |
| APIs | Typed function interfaces | 100+ Python tools in ToolRegistry |
| Policies | Natural language rules | 21 domain rules for tennis analytics |
| Tasks | User interaction scenarios | Pattern-matched query templates in PlanBuilder |
| Evaluation | State comparison | Deterministic output matching via ExecutionContext |