Tool-Agent-User Interaction Benchmark
Benchmarking Reliable Agent Systems
What is Tau-Bench?
Tau-Bench (Tool-Agent-User) is a benchmark framework for evaluating AI agents on their ability to interact with users, use tools/APIs, and follow domain-specific rules. ParableAgent is designed using tau-bench principles to ensure consistent, reliable, and verifiable results.
Core Principles
1. Pass^k Reliability
A task only succeeds if the agent completes it correctly in ALL k attempts. This strict metric ensures consistent performance, not lucky one-offs. ParableAgent targets 100% reliability across repeated queries.
> pass^5("mercy parables")
// Must succeed 5/5 times!
2. Domain Policy Following
Agents must follow domain-specific rules written in natural language. ParableAgent adheres to 21 rules governing data integrity, query processing, and response formatting.
> check_policy("valid_theme")
// Rule: Theme must be in 14 defined themes
3. User Simulation
Tau-bench uses dynamic user simulation to test agent responses. Users may have incomplete information, change their minds, or make compound requests. The agent must handle all scenarios gracefully.
> user: "Actually, show me Luke's parables"
// Agent must adapt to changes
4. Database State Evaluation
Success is measured by comparing database states - the expected final state vs actual state after agent actions. This provides deterministic, verifiable evaluation with no ambiguity.
> verify_state(expected, actual)
// Binary: PASS or FAIL
5. Structured Tool Use
Agents interact via defined APIs with typed parameters. ParableAgent's 21 tools each have explicit inputs and outputs - no ambiguous function calls or undefined behaviors.
> search_by_theme(theme: string)
// Returns: Parable[] (typed!)
6. Compound Request Handling
Users often make multiple requests at once. A robust agent must fully resolve ALL parts - partial completion is a failure. "Show mercy parables AND count them" requires both actions.
> "List AND compare the lost parables"
// Must do BOTH operations!
Common Agent Failure Modes
Tau-bench identifies these key challenges for agents:
Wrong Argument
Agent uses incorrect parameters when calling tools. Example: searching for theme "merciful" when the database expects "mercy".
Wrong Decision
Agent makes logical errors despite having correct information. Example: showing Matthew parables when user asked for Mark.
Partial Resolution
Agent only completes part of a compound request. Example: listing parables but forgetting to also compare them as requested.
Tau-Bench Evaluation Approach
| Component | Description | ParableAgent Implementation |
|---|---|---|
| Database | Structured data storage | 39 parables in JSON format |
| APIs | Typed function interfaces | 21 Python tools |
| Policies | Natural language rules | 21 domain rules in Markdown |
| Tasks | User interaction scenarios | 21 defined task templates |
| Evaluation | State comparison | Deterministic output matching |
Why Tau-Bench Matters
For biblical scholarship, reliability is paramount. Tau-bench principles ensure ParableAgent is not just accurate once, but consistently accurate every time. The pass^k metric means you can trust the same query to always return the same results - essential for serious study and teaching.