A RAG-based, AI-powered, Question & Answer agent that reads Word documents (.docx) and answers questions faithfully, with built-in hallucination detection and monitoring.
- 📄 Document Ingestion: Load and process multiple Word documents
- 🔍 Semantic Search: Find relevant content using vector embeddings
- 🤖 Flexible LLM Backend: Use OpenAI API (cloud) or local models (llama-cpp)
- ✅ Faithfulness Monitoring: Detect and flag potential hallucinations
- 📊 Citation Tracking: Answers include source citations
- 🌐 API Service: REST API for integration with other agents
# 1. Install dependencies
pip install -r requirements.txt
# 2. Run with OpenAI (cloud)
export OPENAI_API_KEY=sk-your-key-here
python main.py --cloud --docs ./your_documents/
# OR run with local LLM (no API key needed)
python main.py --local --docs ./your_documents/- Python 3.9 or higher
- 8GB+ RAM (16GB+ recommended for local LLM)
- NVIDIA GPU with CUDA (optional, for faster local inference)
python -m venv venv
# Linux/Mac
source venv/bin/activate
# Windows
venv\Scripts\activatepip install -r requirements.txt# For NVIDIA GPUs with CUDA 12.1
pip uninstall llama-cpp-python -y
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121
# For CUDA 11.8
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu118Ask questions about your documents in an interactive session:
# Using OpenAI (recommended for best quality)
python main.py --cloud --docs ./documents/
# Using local LLM on CPU
python main.py --local --docs ./documents/
# Using local LLM on GPU (faster)
python main.py --local --gpu-layers -1 --docs ./documents/| Command | Description |
|---|---|
<question> |
Ask a question about the documents |
debug <question> |
Debug retrieval for a question |
search <text> |
Search for text in all chunks |
browse |
Interactive chunk browser |
status |
Show agent status |
add <path> |
Add more documents |
validate |
Run faithfulness validation tests |
quit |
Exit the program |
$ python main.py --cloud --docs ./documents/
📁 Found 5 .docx files in: ./documents/
📚 Ingesting 5 documents...
✓ Processed: report.docx (45 chunks)
✓ Processed: manual.docx (120 chunks)
...
✓ Added 342 chunks to vector store
Question: What was the revenue in 2024?
📝 Answer:
According to the annual report, the company revenue in 2024 was $5.2 million [Source 1],
representing a 15% increase from the previous year [Source 1].
📊 Faithfulness: 0.92
Citation: 0.85
NLI: 1.00
Claims: 0.90
Confidence: 0.95
⏱️ Time: 1250ms
Run as an API service for other agents to consume:
# Start server
python main.py --mode server --cloud --port 8000
# Or with local LLM
python main.py --mode server --local --gpu-layers -1 --port 8000The API will be available at http://localhost:8000.
| Argument | Description | Default |
|---|---|---|
--mode |
Run mode: interactive or server |
interactive |
--cloud |
Use OpenAI API | - |
--local |
Use local LLM (llama-cpp) | - |
--docs |
Path to documents (directory, pattern, or file) | - |
--gpu-layers |
GPU layers for local LLM (0=CPU, -1=all) | 0 |
--model-repo |
HuggingFace repo for local model | See config.py |
--model-file |
GGUF filename for local model | See config.py |
--port |
Server port | 8000 |
--host |
Server host | 0.0.0.0 |
| Variable | Description |
|---|---|
OPENAI_API_KEY |
OpenAI API key (required for cloud mode) |
OPENAI_MODEL |
OpenAI model name (default: gpt-5.2) |
LLM_BACKEND |
Default backend: openai or local |
LOCAL_MODEL_REPO |
HuggingFace repo ID for local model |
LOCAL_MODEL_FILE |
GGUF filename |
GPU_LAYERS |
Default GPU layers |
EMBEDDING_MODEL |
Sentence transformer model for embeddings |
Health check endpoint.
curl http://localhost:8000/healthResponse:
{"status": "healthy", "llm": "Cloud (OpenAI: gpt-5.2)"}Get agent status and statistics.
curl http://localhost:8000/statusResponse:
{
"status": "ready",
"llm_backend": "cloud",
"llm_description": "Cloud (OpenAI: gpt-5.2)",
"ingested_documents": 5,
"total_chunks": 342
}Ingest documents into the agent.
curl -X POST http://localhost:8000/ingest \
-H "Content-Type: application/json" \
-d '{"file_paths": ["/path/to/doc1.docx", "/path/to/doc2.docx"]}'Response:
{
"success": true,
"documents_ingested": 2,
"details": [
{"filename": "doc1.docx", "chunks": 45},
{"filename": "doc2.docx", "chunks": 67}
]
}Query the documents.
curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{"question": "What was the revenue in 2024?", "require_high_confidence": true}'Response:
{
"question": "What was the revenue in 2024?",
"answer": "The company revenue in 2024 was $5.2 million [Source 1]...",
"sources": [...],
"faithfulness": {
"overall_score": 0.92,
"citation_coverage": 0.85,
"nli_entailment_score": 1.0,
"claim_support_ratio": 0.9,
"confidence_score": 0.95,
"warnings": []
},
"abstained": false,
"processing_time_ms": 1250.5
}Simple query returning just the answer.
curl -X POST "http://localhost:8000/query/simple?question=What%20is%20the%20revenue"Response:
{
"answer": "The company revenue was $5.2 million.",
"confidence": 0.92,
"is_reliable": true,
"source_count": 3
}import requests
class DocumentQAClient:
def __init__(self, base_url="http://localhost:8000"):
self.base_url = base_url
def query(self, question, require_high_confidence=False):
response = requests.post(
f"{self.base_url}/query",
json={
"question": question,
"require_high_confidence": require_high_confidence
}
)
return response.json()
# Usage
client = DocumentQAClient()
result = client.query("What was the revenue in 2024?")
if result["faithfulness"]["overall_score"] >= 0.7:
print("Answer:", result["answer"])
else:
print("Low confidence answer - verify manually")The agent monitors answer faithfulness using multiple strategies:
| Metric | Description | Weight |
|---|---|---|
| Citation Coverage | Are claims properly cited with [Source N]? | 20% |
| NLI Entailment | Does the source text logically support the answer? | 30% |
| Claim Support | Are individual claims verified against sources? | 30% |
| Confidence | Model's self-assessed confidence | 20% |
| Score | Interpretation |
|---|---|
| ≥ 0.8 | ✅ High confidence - answer is reliable |
| 0.6 - 0.8 | |
| < 0.6 | ❌ Low confidence - likely issues |
The agent may generate warnings such as:
- "Low citation coverage - many claims are not cited"
- "NLI check found potentially unsupported statements"
- "Some claims could not be verified against sources"
- "Low model confidence in the answer"
document_qa_agent/
├── main.py # Entry point
├── config.py # Configuration
├── requirements.txt # Dependencies
├── models/
│ └── schemas.py # Pydantic data models
├── core/
│ ├── document_processor.py # Document loading & chunking
│ ├── vector_store.py # Embeddings & search
│ ├── llm_client.py # LLM abstraction (OpenAI/local)
│ ├── generator.py # Answer generation
│ ├── faithfulness.py # Hallucination detection
│ └── query_rewriter.py # Query expansion
├── agent/
│ └── qa_agent.py # Main agent orchestration
├── service/
│ └── api.py # FastAPI REST service
└── tests/
└── test_faithfulness.py # Validation tests
# Set the environment variable
export OPENAI_API_KEY=sk-your-key-here
# Or use local mode instead
python main.py --local --docs ./documents/The embedding model changed. Clear the vector store:
rm -rf ./chroma_dbReduce GPU layers:
# Try fewer layers
python main.py --local --gpu-layers 20 --docs ./documents/
# Or use CPU only
python main.py --local --gpu-layers 0 --docs ./documents/This project uses llama-cpp-python, not Ollama. Use --local flag:
python main.py --local --docs ./documents/Models are cached in ./models/. First run downloads the model (~4-8GB).
Subsequent runs use the cache.
Qwen3 has a "thinking mode" that can interfere with JSON output. Use Qwen2.5 instead:
python main.py --local \
--model-repo Qwen/Qwen2.5-14B-Instruct-GGUF \
--model-file qwen2.5-14b-instruct-q4_k_m.gguf \
--docs ./documents/# Run faithfulness validation
python -m tests.test_faithfulness
# Or in interactive mode
python main.py --cloud --docs ./documents/
> validate- Built with LangChain patterns
- Vector store: ChromaDB
- Embeddings: Sentence Transformers
- Local LLM: llama-cpp-python
---
## `QUICKSTART.md` (Optional - One Page Version)
```markdown
# Quick Start Guide
## 1. Install
```bash
pip install -r requirements.txt
export OPENAI_API_KEY=sk-your-key-here
python main.py --cloud --docs ./your_documents/# CPU
python main.py --local --docs ./your_documents/
# GPU (faster)
python main.py --local --gpu-layers -1 --docs ./your_documents/Question: What is the main topic of the documents?
📝 Answer:
The documents primarily discuss...
📊 Faithfulness: 0.89
- Type a question to ask
debug <question>- Debug searchsearch <text>- Find text in documentsstatus- Show statsquit- Exit
python main.py --mode server --cloud --port 8000Then query:
curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{"question": "What is the revenue?"}'
---
## File: `.env.example`
```bash
# Copy this to .env and fill in your values
# OpenAI API (for cloud mode)
OPENAI_API_KEY=sk-your-key-here
OPENAI_MODEL=gpt-5.2
# Local LLM settings (for local mode)
LOCAL_MODEL_REPO=Qwen/Qwen2.5-14B-Instruct-GGUF
LOCAL_MODEL_FILE=qwen2.5-14b-instruct-q4_k_m.gguf
GPU_LAYERS=-1
# Embedding model
EMBEDDING_MODEL=nomic-ai/nomic-embed-text-v2-moe