What is Tau-Bench?

Tau-Bench (Tool-Agent-User) is a benchmark framework for evaluating AI agents on their ability to interact with users, use tools/APIs, and follow domain-specific rules. TennisAgent implements tau-bench principles to ensure deterministic, reliable, and verifiable results when analyzing your tennis practice data.

Core Principles

01

Pass^k Reliability

A query succeeds only if it produces correct results in ALL k attempts. TennisAgent tools are deterministic - "analyze my best session" returns identical results every time with the same underlying data.

02

Domain Policy Following

The agent follows 21 rules governing data integrity, query processing, and tool execution. Tennis-specific constraints ensure valid sensor readings and proper session linking.

03

User Simulation

Tau-bench tests agents with dynamic user behavior - incomplete info, changed requests, compound queries. TennisAgent handles "show my best session, actually make it last week's" gracefully.

04

Database State Evaluation

Success is measured by comparing expected vs actual database states after queries. Binary pass/fail - no partial credit for "close enough" analytics.

05

Structured Tool Use

All 100+ tools have typed parameters and explicit return schemas. detect_swings(session_id: string, threshold: number) returns SwingResult - no ambiguous outputs.

06

Compound Request Handling

Multi-part queries must be fully resolved. "Analyze my best session AND compare it to last week" requires both operations - partial completion is failure.

Common Failure Modes

Tau-bench identifies three key challenges that TennisAgent is designed to avoid:

!

Wrong Argument

Using incorrect parameters in tool calls. Example: passing "AppleWatch" when the database expects "watch" as device filter.

?

Wrong Decision

Logical errors despite correct information. Example: returning Zepp data when user requested Apple Watch session.

1/2

Partial Resolution

Completing only part of compound requests. Example: finding best session but forgetting to also visualize it as requested.

Tau-Bench Implementation

Component Tau-Bench Description TennisAgent Implementation
Database Structured data storage SQLite: tennis_watch.db, ztennis.db, BabPopExt.db
APIs Typed function interfaces 100+ Python tools in ToolRegistry
Policies Natural language rules 21 domain rules for tennis analytics
Tasks User interaction scenarios Pattern-matched query templates in PlanBuilder
Evaluation State comparison Deterministic output matching via ExecutionContext

Why This Matters for Your Tennis Data

When you ask "how many swings did I hit yesterday?", you need a reliable answer. Tau-bench principles ensure TennisAgent returns the same accurate count whether you ask once or ten times. No hallucinated statistics, no inconsistent swing counts, no mysterious changes between queries.