Benchmark | ShowcaseAgent

The Core Hypothesis

"Local LLMs perform better with abstracted meta-tools (14 domain tools) than direct access to 649+ individual tools."

This hypothesis is tested through systematic benchmarking across:

5 LLM providers: Ollama, Anthropic, OpenAI, DeepSeek, Google
2 modes: Meta-tool (14 tools) vs Direct (649+ tools)
150+ queries: Cross-domain validation set

Sprint 4 Baseline Results

80.6%

Routing Accuracy

57.3%

Execution Accuracy

103

Test Queries

83

Correct Routes

Per-Domain Results

Domain	Queries	Correct	Accuracy	Status
TennisAgent	15	15	100%	Perfect
YieldModel	10	10	100%	Perfect
QCiAgent	10	10	100%	Perfect
PPR_Agent	8	8	100%	Perfect
Portfolio	12	11	91.7%	High
Optiver	10	8	80%	Good
ParableAgent	10	8	80%	Good
AI_WQ	8	6	75%	Moderate
Stan	10	4	40%	Needs work
WQ	10	3	25%	Needs work

LLM Provider Configuration

# benchmark/config.py - Provider Configuration PROVIDERS = { "ollama": { "base_url": "http://localhost:11434", "models": ["qwen2.5:14b", "llama3.1:8b", "mistral:7b"], "mode": "meta" # Recommended for local }, "anthropic": { "models": ["claude-3-5-sonnet-20241022", "claude-3-haiku"], "mode": "direct" # Can handle 649+ tools }, "openai": { "models": ["gpt-4o", "gpt-4o-mini"], "mode": "direct" }, "deepseek": { "models": ["deepseek-chat", "deepseek-coder"], "mode": "meta" }, "google": { "models": ["gemini-1.5-pro", "gemini-1.5-flash"], "mode": "direct" } }

Running Benchmarks

# CLI Benchmark Commands # Meta-tool mode with Ollama python benchmark.py queries/cross_domain.csv -p ollama --mode meta # Direct mode with Anthropic python benchmark.py queries/cross_domain.csv -p anthropic --mode direct # With LLM routing fallback python benchmark.py queries/cross_domain.csv -p ollama --mode meta --llm-routing # Analyze results python analysis.py results/ollama_meta_20250115.csv

Query Categories

Category	Queries	Example
Clear Domain	60	"analyze my forehand consistency"
Ambiguous	25	"show me my performance trends"
Cross-Domain	18	"compare my tennis and investment returns"

Hypothesis Matrix

Comparing meta-tool vs direct mode across providers.

Provider	Meta-Tool	Direct	Winner
Ollama (qwen2.5:14b)	78%	45%	Meta-Tool
Anthropic (Claude 3.5)	82%	89%	Direct
OpenAI (GPT-4o)	80%	86%	Direct
DeepSeek	75%	52%	Meta-Tool

Hypothesis confirmed: Local LLMs benefit from meta-tool abstraction; cloud LLMs can handle direct mode.

Benchmark Platform