~*~ Tau-Bench Principles ~*~

[Back to Home]


Tool-Agent-User Interaction Benchmark

Benchmarking Reliable Agent Systems

What is Tau-Bench?

Tau-Bench (Tool-Agent-User) is a benchmark framework for evaluating AI agents on their ability to interact with users, use tools/APIs, and follow domain-specific rules. ParableAgent is designed using tau-bench principles to ensure consistent, reliable, and verifiable results.


Core Principles

📊

1. Pass^k Reliability

A task only succeeds if the agent completes it correctly in ALL k attempts. This strict metric ensures consistent performance, not lucky one-offs. ParableAgent targets 100% reliability across repeated queries.

> pass^5("mercy parables")

// Must succeed 5/5 times!

📝

2. Domain Policy Following

Agents must follow domain-specific rules written in natural language. ParableAgent adheres to 21 rules governing data integrity, query processing, and response formatting.

> check_policy("valid_theme")

// Rule: Theme must be in 14 defined themes

🗣

3. User Simulation

Tau-bench uses dynamic user simulation to test agent responses. Users may have incomplete information, change their minds, or make compound requests. The agent must handle all scenarios gracefully.

> user: "Actually, show me Luke's parables"

// Agent must adapt to changes

🗃

4. Database State Evaluation

Success is measured by comparing database states - the expected final state vs actual state after agent actions. This provides deterministic, verifiable evaluation with no ambiguity.

> verify_state(expected, actual)

// Binary: PASS or FAIL

🔧

5. Structured Tool Use

Agents interact via defined APIs with typed parameters. ParableAgent's 21 tools each have explicit inputs and outputs - no ambiguous function calls or undefined behaviors.

> search_by_theme(theme: string)

// Returns: Parable[] (typed!)

🎯

6. Compound Request Handling

Users often make multiple requests at once. A robust agent must fully resolve ALL parts - partial completion is a failure. "Show mercy parables AND count them" requires both actions.

> "List AND compare the lost parables"

// Must do BOTH operations!


Common Agent Failure Modes

Tau-bench identifies these key challenges for agents:

Wrong Argument

Agent uses incorrect parameters when calling tools. Example: searching for theme "merciful" when the database expects "mercy".

Wrong Decision

Agent makes logical errors despite having correct information. Example: showing Matthew parables when user asked for Mark.

½

Partial Resolution

Agent only completes part of a compound request. Example: listing parables but forgetting to also compare them as requested.


Tau-Bench Evaluation Approach

Component Description ParableAgent Implementation
Database Structured data storage 39 parables in JSON format
APIs Typed function interfaces 21 Python tools
Policies Natural language rules 21 domain rules in Markdown
Tasks User interaction scenarios 21 defined task templates
Evaluation State comparison Deterministic output matching

For biblical scholarship, reliability is paramount. Tau-bench principles ensure ParableAgent is not just accurate once, but consistently accurate every time. The pass^k metric means you can trust the same query to always return the same results - essential for serious study and teaching.