Summary
The Agent-in-the-Loop (AIL) testing framework has been implemented in PR #63, but test pass rates don't yet meet the defined targets. This issue tracks the remaining work to improve agent performance.
Current vs Target Pass Rates
| Tier |
Current |
Target |
Gap |
| Tier 1 |
67% (2/3) |
>90% |
-23% |
| Tier 2 |
67% (2/3) |
>75% |
-8% |
| Tier 3 |
100% (3/3) |
>60% |
✅ Met |
Identified Issues
1. Search Ranking for Specific Content
Problem: The Art of War terrain/ground content exists in the database but hybrid search doesn't rank it highly enough for the agent to find it consistently.
Evidence: Direct database queries show 9+ chunks with terrain/ground content, but agent searches fail to surface them.
Potential Solutions:
- Investigate vector embedding quality for this content
- Review BM25 tokenization for chapter titles
- Consider boosting exact phrase matches
2. LLM Non-Determinism in Aggregate Tests
Problem: Aggregate tests re-run all scenarios independently, and LLM variance causes different results between runs, making pass rates inconsistent.
Potential Solutions:
- Cache individual test results for aggregate calculations
- Use lower temperature (currently 0.1)
- Run multiple iterations and use majority vote
3. Residual Search Narration
Problem: Despite explicit rules, the agent occasionally outputs search narration ("Let me try searching...") instead of synthesizing answers.
Evidence: Art of War formations test answer: 'Let me try searching for "six kinds" or "nine kinds" more directly:...'
Potential Solutions:
- Strengthen agent-quick-rules.md prohibitions
- Add post-processing to detect/reject narration
- Consider fine-tuned model or few-shot examples
4. Model Cost vs Performance Tradeoff
Observation: Claude Sonnet 4 performs better but costs more. Claude Haiku 4.5 is more cost-effective but less reliable at following complex instructions.
Potential Solutions:
- Use Sonnet for complex Tier 3 tasks, Haiku for simple Tier 1
- Investigate other models (GPT-4o, Gemini)
- Document cost/performance tradeoffs for users
Related
Files to Investigate
src/infrastructure/search/conceptual-hybrid-search-service.ts - Hybrid search ranking
prompts/agent-quick-rules.md - Agent behavior rules
src/__tests__/ail/config.ts - Model configuration
Success Criteria
Summary
The Agent-in-the-Loop (AIL) testing framework has been implemented in PR #63, but test pass rates don't yet meet the defined targets. This issue tracks the remaining work to improve agent performance.
Current vs Target Pass Rates
Identified Issues
1. Search Ranking for Specific Content
Problem: The Art of War terrain/ground content exists in the database but hybrid search doesn't rank it highly enough for the agent to find it consistently.
Evidence: Direct database queries show 9+ chunks with terrain/ground content, but agent searches fail to surface them.
Potential Solutions:
2. LLM Non-Determinism in Aggregate Tests
Problem: Aggregate tests re-run all scenarios independently, and LLM variance causes different results between runs, making pass rates inconsistent.
Potential Solutions:
3. Residual Search Narration
Problem: Despite explicit rules, the agent occasionally outputs search narration ("Let me try searching...") instead of synthesizing answers.
Evidence: Art of War formations test answer:
'Let me try searching for "six kinds" or "nine kinds" more directly:...'Potential Solutions:
4. Model Cost vs Performance Tradeoff
Observation: Claude Sonnet 4 performs better but costs more. Claude Haiku 4.5 is more cost-effective but less reliable at following complex instructions.
Potential Solutions:
Related
Files to Investigate
src/infrastructure/search/conceptual-hybrid-search-service.ts- Hybrid search rankingprompts/agent-quick-rules.md- Agent behavior rulessrc/__tests__/ail/config.ts- Model configurationSuccess Criteria