Project Phoenix Meta-Domain

Benchmark Platform

Hypothesis Testing Across 5 Providers

The Core Hypothesis

"Local LLMs perform better with abstracted meta-tools (14 domain tools) than direct access to 649+ individual tools."

This hypothesis is tested through systematic benchmarking across:

  • 5 LLM providers: Ollama, Anthropic, OpenAI, DeepSeek, Google
  • 2 modes: Meta-tool (14 tools) vs Direct (649+ tools)
  • 150+ queries: Cross-domain validation set

Sprint 4 Baseline Results

80.6%
Routing Accuracy
57.3%
Execution Accuracy
103
Test Queries
83
Correct Routes

Per-Domain Results

Domain Queries Correct Accuracy Status
TennisAgent 15 15 100% Perfect
YieldModel 10 10 100% Perfect
QCiAgent 10 10 100% Perfect
PPR_Agent 8 8 100% Perfect
Portfolio 12 11 91.7% High
Optiver 10 8 80% Good
ParableAgent 10 8 80% Good
AI_WQ 8 6 75% Moderate
Stan 10 4 40% Needs work
WQ 10 3 25% Needs work

LLM Provider Configuration

# benchmark/config.py - Provider Configuration PROVIDERS = { "ollama": { "base_url": "http://localhost:11434", "models": ["qwen2.5:14b", "llama3.1:8b", "mistral:7b"], "mode": "meta" # Recommended for local }, "anthropic": { "models": ["claude-3-5-sonnet-20241022", "claude-3-haiku"], "mode": "direct" # Can handle 649+ tools }, "openai": { "models": ["gpt-4o", "gpt-4o-mini"], "mode": "direct" }, "deepseek": { "models": ["deepseek-chat", "deepseek-coder"], "mode": "meta" }, "google": { "models": ["gemini-1.5-pro", "gemini-1.5-flash"], "mode": "direct" } }

Running Benchmarks

# CLI Benchmark Commands # Meta-tool mode with Ollama python benchmark.py queries/cross_domain.csv -p ollama --mode meta # Direct mode with Anthropic python benchmark.py queries/cross_domain.csv -p anthropic --mode direct # With LLM routing fallback python benchmark.py queries/cross_domain.csv -p ollama --mode meta --llm-routing # Analyze results python analysis.py results/ollama_meta_20250115.csv

Query Categories

Category Queries Example
Clear Domain 60 "analyze my forehand consistency"
Ambiguous 25 "show me my performance trends"
Cross-Domain 18 "compare my tennis and investment returns"

Hypothesis Matrix

Comparing meta-tool vs direct mode across providers.

Provider Meta-Tool Direct Winner
Ollama (qwen2.5:14b) 78% 45% Meta-Tool
Anthropic (Claude 3.5) 82% 89% Direct
OpenAI (GPT-4o) 80% 86% Direct
DeepSeek 75% 52% Meta-Tool

Hypothesis confirmed: Local LLMs benefit from meta-tool abstraction; cloud LLMs can handle direct mode.

Failure Analysis

Common Routing Failures

  • WQ vs Stan overlap: Statistical methods appear in both
  • AI_WQ vs generic ML: "deep learning" too broad
  • Portfolio vs Optiver: Finance terms overlap

Improvement Strategies

  • Add negative patterns (what domain is NOT)
  • Increase LLM routing threshold to 0.6
  • Add domain-specific keywords to patterns