The Core Hypothesis
"Local LLMs perform better with abstracted meta-tools (14 domain tools) than direct access to 649+ individual tools."
This hypothesis is tested through systematic benchmarking across:
- 5 LLM providers: Ollama, Anthropic, OpenAI, DeepSeek, Google
- 2 modes: Meta-tool (14 tools) vs Direct (649+ tools)
- 150+ queries: Cross-domain validation set
Sprint 4 Baseline Results
Per-Domain Results
| Domain |
Queries |
Correct |
Accuracy |
Status |
| TennisAgent |
15 |
15 |
100% |
Perfect |
| YieldModel |
10 |
10 |
100% |
Perfect |
| QCiAgent |
10 |
10 |
100% |
Perfect |
| PPR_Agent |
8 |
8 |
100% |
Perfect |
| Portfolio |
12 |
11 |
91.7% |
High |
| Optiver |
10 |
8 |
80% |
Good |
| ParableAgent |
10 |
8 |
80% |
Good |
| AI_WQ |
8 |
6 |
75% |
Moderate |
| Stan |
10 |
4 |
40% |
Needs work |
| WQ |
10 |
3 |
25% |
Needs work |
LLM Provider Configuration
# benchmark/config.py - Provider Configuration
PROVIDERS = {
"ollama": {
"base_url": "http://localhost:11434",
"models": ["qwen2.5:14b", "llama3.1:8b", "mistral:7b"],
"mode": "meta" # Recommended for local
},
"anthropic": {
"models": ["claude-3-5-sonnet-20241022", "claude-3-haiku"],
"mode": "direct" # Can handle 649+ tools
},
"openai": {
"models": ["gpt-4o", "gpt-4o-mini"],
"mode": "direct"
},
"deepseek": {
"models": ["deepseek-chat", "deepseek-coder"],
"mode": "meta"
},
"google": {
"models": ["gemini-1.5-pro", "gemini-1.5-flash"],
"mode": "direct"
}
}
Running Benchmarks
# CLI Benchmark Commands
# Meta-tool mode with Ollama
python benchmark.py queries/cross_domain.csv -p ollama --mode meta
# Direct mode with Anthropic
python benchmark.py queries/cross_domain.csv -p anthropic --mode direct
# With LLM routing fallback
python benchmark.py queries/cross_domain.csv -p ollama --mode meta --llm-routing
# Analyze results
python analysis.py results/ollama_meta_20250115.csv
Query Categories
| Category |
Queries |
Example |
| Clear Domain |
60 |
"analyze my forehand consistency" |
| Ambiguous |
25 |
"show me my performance trends" |
| Cross-Domain |
18 |
"compare my tennis and investment returns" |
Hypothesis Matrix
Comparing meta-tool vs direct mode across providers.
| Provider |
Meta-Tool |
Direct |
Winner |
| Ollama (qwen2.5:14b) |
78% |
45% |
Meta-Tool |
| Anthropic (Claude 3.5) |
82% |
89% |
Direct |
| OpenAI (GPT-4o) |
80% |
86% |
Direct |
| DeepSeek |
75% |
52% |
Meta-Tool |
Hypothesis confirmed: Local LLMs benefit from meta-tool abstraction; cloud LLMs can handle direct mode.
Failure Analysis
Common Routing Failures
- WQ vs Stan overlap: Statistical methods appear in both
- AI_WQ vs generic ML: "deep learning" too broad
- Portfolio vs Optiver: Finance terms overlap
Improvement Strategies
- Add negative patterns (what domain is NOT)
- Increase LLM routing threshold to 0.6
- Add domain-specific keywords to patterns