RAG Document AI Agent

A RAG-based, AI-powered, Question & Answer agent that reads Word documents (.docx) and answers questions faithfully, with built-in hallucination detection and monitoring.

Features

📄 Document Ingestion: Load and process multiple Word documents
🔍 Semantic Search: Find relevant content using vector embeddings
🤖 Flexible LLM Backend: Use OpenAI API (cloud) or local models (llama-cpp)
✅ Faithfulness Monitoring: Detect and flag potential hallucinations
📊 Citation Tracking: Answers include source citations
🌐 API Service: REST API for integration with other agents

Quick Start

# 1. Install dependencies
pip install -r requirements.txt

# 2. Run with OpenAI (cloud)
export OPENAI_API_KEY=sk-your-key-here
python main.py --cloud --docs ./your_documents/

# OR run with local LLM (no API key needed)
python main.py --local --docs ./your_documents/

Installation

Prerequisites

Python 3.9 or higher
8GB+ RAM (16GB+ recommended for local LLM)
NVIDIA GPU with CUDA (optional, for faster local inference)

Step 1: Create Virtual Environment

python -m venv venv

# Linux/Mac
source venv/bin/activate

# Windows
venv\Scripts\activate

Step 2: Install Dependencies

pip install -r requirements.txt

Step 3 (Optional): Enable GPU Acceleration for Local LLM

# For NVIDIA GPUs with CUDA 12.1
pip uninstall llama-cpp-python -y
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121

# For CUDA 11.8
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu118

Usage

Interactive Mode

Ask questions about your documents in an interactive session:

# Using OpenAI (recommended for best quality)
python main.py --cloud --docs ./documents/

# Using local LLM on CPU
python main.py --local --docs ./documents/

# Using local LLM on GPU (faster)
python main.py --local --gpu-layers -1 --docs ./documents/

Interactive Commands

Command	Description
`<question>`	Ask a question about the documents
`debug <question>`	Debug retrieval for a question
`search <text>`	Search for text in all chunks
`browse`	Interactive chunk browser
`status`	Show agent status
`add <path>`	Add more documents
`validate`	Run faithfulness validation tests
`quit`	Exit the program

Example Session

$ python main.py --cloud --docs ./documents/

📁 Found 5 .docx files in: ./documents/
📚 Ingesting 5 documents...
✓ Processed: report.docx (45 chunks)
✓ Processed: manual.docx (120 chunks)
...
✓ Added 342 chunks to vector store

Question: What was the revenue in 2024?

📝 Answer:
According to the annual report, the company revenue in 2024 was $5.2 million [Source 1], 
representing a 15% increase from the previous year [Source 1].

📊 Faithfulness: 0.92
   Citation: 0.85
   NLI: 1.00
   Claims: 0.90
   Confidence: 0.95

⏱️ Time: 1250ms

Server Mode

Run as an API service for other agents to consume:

# Start server
python main.py --mode server --cloud --port 8000

# Or with local LLM
python main.py --mode server --local --gpu-layers -1 --port 8000

The API will be available at http://localhost:8000.

Configuration

Command Line Arguments

Argument	Description	Default
`--mode`	Run mode: `interactive` or `server`	`interactive`
`--cloud`	Use OpenAI API	-
`--local`	Use local LLM (llama-cpp)	-
`--docs`	Path to documents (directory, pattern, or file)	-
`--gpu-layers`	GPU layers for local LLM (0=CPU, -1=all)	`0`
`--model-repo`	HuggingFace repo for local model	See config.py
`--model-file`	GGUF filename for local model	See config.py
`--port`	Server port	`8000`
`--host`	Server host	`0.0.0.0`

Environment Variables

Variable	Description
`OPENAI_API_KEY`	OpenAI API key (required for cloud mode)
`OPENAI_MODEL`	OpenAI model name (default: `gpt-5.2`)
`LLM_BACKEND`	Default backend: `openai` or `local`
`LOCAL_MODEL_REPO`	HuggingFace repo ID for local model
`LOCAL_MODEL_FILE`	GGUF filename
`GPU_LAYERS`	Default GPU layers
`EMBEDDING_MODEL`	Sentence transformer model for embeddings

API Reference

Endpoints

`GET /health`

Health check endpoint.

curl http://localhost:8000/health

Response:

{"status": "healthy", "llm": "Cloud (OpenAI: gpt-5.2)"}

`GET /status`

Get agent status and statistics.

curl http://localhost:8000/status

Response:

{
  "status": "ready",
  "llm_backend": "cloud",
  "llm_description": "Cloud (OpenAI: gpt-5.2)",
  "ingested_documents": 5,
  "total_chunks": 342
}

`POST /ingest`

Ingest documents into the agent.

curl -X POST http://localhost:8000/ingest \
  -H "Content-Type: application/json" \
  -d '{"file_paths": ["/path/to/doc1.docx", "/path/to/doc2.docx"]}'

Response:

{
  "success": true,
  "documents_ingested": 2,
  "details": [
    {"filename": "doc1.docx", "chunks": 45},
    {"filename": "doc2.docx", "chunks": 67}
  ]
}

`POST /query`

Query the documents.

curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"question": "What was the revenue in 2024?", "require_high_confidence": true}'

Response:

{
  "question": "What was the revenue in 2024?",
  "answer": "The company revenue in 2024 was $5.2 million [Source 1]...",
  "sources": [...],
  "faithfulness": {
    "overall_score": 0.92,
    "citation_coverage": 0.85,
    "nli_entailment_score": 1.0,
    "claim_support_ratio": 0.9,
    "confidence_score": 0.95,
    "warnings": []
  },
  "abstained": false,
  "processing_time_ms": 1250.5
}

`POST /query/simple`

Simple query returning just the answer.

curl -X POST "http://localhost:8000/query/simple?question=What%20is%20the%20revenue"

Response:

{
  "answer": "The company revenue was $5.2 million.",
  "confidence": 0.92,
  "is_reliable": true,
  "source_count": 3
}

Python Client Example

import requests

class DocumentQAClient:
    def __init__(self, base_url="http://localhost:8000"):
        self.base_url = base_url
    
    def query(self, question, require_high_confidence=False):
        response = requests.post(
            f"{self.base_url}/query",
            json={
                "question": question,
                "require_high_confidence": require_high_confidence
            }
        )
        return response.json()

# Usage
client = DocumentQAClient()
result = client.query("What was the revenue in 2024?")

if result["faithfulness"]["overall_score"] >= 0.7:
    print("Answer:", result["answer"])
else:
    print("Low confidence answer - verify manually")

Faithfulness Monitoring

The agent monitors answer faithfulness using multiple strategies:

Metric	Description	Weight
Citation Coverage	Are claims properly cited with [Source N]?	20%
NLI Entailment	Does the source text logically support the answer?	30%
Claim Support	Are individual claims verified against sources?	30%
Confidence	Model's self-assessed confidence	20%

Interpreting Scores

Score	Interpretation
≥ 0.8	✅ High confidence - answer is reliable
0.6 - 0.8	⚠️ Moderate confidence - review recommended
< 0.6	❌ Low confidence - likely issues

Warnings

The agent may generate warnings such as:

"Low citation coverage - many claims are not cited"
"NLI check found potentially unsupported statements"
"Some claims could not be verified against sources"
"Low model confidence in the answer"

Project Structure

document_qa_agent/
├── main.py                 # Entry point
├── config.py               # Configuration
├── requirements.txt        # Dependencies
├── models/
│   └── schemas.py          # Pydantic data models
├── core/
│   ├── document_processor.py  # Document loading & chunking
│   ├── vector_store.py        # Embeddings & search
│   ├── llm_client.py          # LLM abstraction (OpenAI/local)
│   ├── generator.py           # Answer generation
│   ├── faithfulness.py        # Hallucination detection
│   └── query_rewriter.py      # Query expansion
├── agent/
│   └── qa_agent.py         # Main agent orchestration
├── service/
│   └── api.py              # FastAPI REST service
└── tests/
    └── test_faithfulness.py   # Validation tests

Troubleshooting

"OPENAI_API_KEY not set"

# Set the environment variable
export OPENAI_API_KEY=sk-your-key-here

# Or use local mode instead
python main.py --local --docs ./documents/

"Collection expecting embedding with dimension of X, got Y"

The embedding model changed. Clear the vector store:

rm -rf ./chroma_db

"CUDA out of memory"

Reduce GPU layers:

# Try fewer layers
python main.py --local --gpu-layers 20 --docs ./documents/

# Or use CPU only
python main.py --local --gpu-layers 0 --docs ./documents/

"Cannot connect to Ollama" (if using Ollama)

This project uses llama-cpp-python, not Ollama. Use --local flag:

python main.py --local --docs ./documents/

Local model downloads are slow

Models are cached in ./models/. First run downloads the model (~4-8GB). Subsequent runs use the cache.

JSON parsing errors with Qwen3

Qwen3 has a "thinking mode" that can interfere with JSON output. Use Qwen2.5 instead:

python main.py --local \
  --model-repo Qwen/Qwen2.5-14B-Instruct-GGUF \
  --model-file qwen2.5-14b-instruct-q4_k_m.gguf \
  --docs ./documents/

Running Tests

# Run faithfulness validation
python -m tests.test_faithfulness

# Or in interactive mode
python main.py --cloud --docs ./documents/
> validate

Acknowledgments

Built with LangChain patterns
Vector store: ChromaDB
Embeddings: Sentence Transformers
Local LLM: llama-cpp-python


---

## `QUICKSTART.md` (Optional - One Page Version)

```markdown
# Quick Start Guide

## 1. Install

```bash
pip install -r requirements.txt

2. Run

Option A: Cloud (OpenAI) - Best Quality

export OPENAI_API_KEY=sk-your-key-here
python main.py --cloud --docs ./your_documents/

Option B: Local - No API Key Needed

# CPU
python main.py --local --docs ./your_documents/

# GPU (faster)
python main.py --local --gpu-layers -1 --docs ./your_documents/

3. Ask Questions

Question: What is the main topic of the documents?

📝 Answer:
The documents primarily discuss...

📊 Faithfulness: 0.89

Commands

Type a question to ask
debug <question> - Debug search
search <text> - Find text in documents
status - Show stats
quit - Exit

API Mode

python main.py --mode server --cloud --port 8000

Then query:

curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"question": "What is the revenue?"}'


---

## File: `.env.example`

```bash
# Copy this to .env and fill in your values

# OpenAI API (for cloud mode)
OPENAI_API_KEY=sk-your-key-here
OPENAI_MODEL=gpt-5.2

# Local LLM settings (for local mode)
LOCAL_MODEL_REPO=Qwen/Qwen2.5-14B-Instruct-GGUF
LOCAL_MODEL_FILE=qwen2.5-14b-instruct-q4_k_m.gguf
GPU_LAYERS=-1

# Embedding model
EMBEDDING_MODEL=nomic-ai/nomic-embed-text-v2-moe

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
docs		docs
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

RAG Document AI Agent

Features

Table of Contents

Quick Start

Installation

Prerequisites

Step 1: Create Virtual Environment

Step 2: Install Dependencies

Step 3 (Optional): Enable GPU Acceleration for Local LLM

Usage

Interactive Mode

Interactive Commands

Example Session

Server Mode

Configuration

Command Line Arguments

Environment Variables

API Reference

Endpoints

GET /health

GET /status

POST /ingest

POST /query

POST /query/simple

Python Client Example

Faithfulness Monitoring

Interpreting Scores

Warnings

Project Structure

Troubleshooting

"OPENAI_API_KEY not set"

"Collection expecting embedding with dimension of X, got Y"

"CUDA out of memory"

"Cannot connect to Ollama" (if using Ollama)

Local model downloads are slow

JSON parsing errors with Qwen3

Running Tests

Acknowledgments

2. Run

Option A: Cloud (OpenAI) - Best Quality

Option B: Local - No API Key Needed

3. Ask Questions

Commands

API Mode

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`GET /health`

`GET /status`

`POST /ingest`

`POST /query`

`POST /query/simple`

Packages