A local-first Gradio Retrieval-Augmented Generation (RAG) assistant.
- Upload
.txt/.mddocuments and build a local index. - Configure chunking (
chunk_size,chunk_overlap) at index time. - Ask questions in a chat UI backed by retrieval + LLM generation.
- Every response is expected to include chunk citations like
[<chunk_id>]. - Abstention guardrail when evidence is missing or weak:
I don't have enough evidence in the uploaded documents.
- Friendly error message if LM Studio is unavailable.
- Python 3.10+
- LM Studio with
lmsCLI available - LM Studio local server enabled (OpenAI-compatible API, default port 1234)
python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -e .[dev]# Embedding model (~639 MB, optimized for retrieval)
lms get jinaai/jina-embeddings-v5-text-small-retrieval-GGUF
# select: Q8_0
# Chat model (~9 GB, good balance of quality and VRAM usage)
lms get bartowski/Qwen2.5-14B-Instruct-GGUF
# select: Q4_K_Mlms load jina-embeddings-v5-text-small-retrieval
lms load qwen2.5-14b-instruct
lms server startVerify both models are running:
lms psNote: The identifiers above (
jina-embeddings-v5-text-small-retrievalandqwen2.5-14b-instruct) are what LM Studio assigns after loading. Use these in your.env.
- Copy environment template:
cp .env.example .env- Set environment values in
.env:
LLM_BASE_URL=http://localhost:1234/v1
LLM_API_KEY=lm-studio
LLM_MODEL=qwen2.5-14b-instruct
EMBEDDING_MODEL=all-MiniLM-L6-v2
VECTOR_STORE_DIR=.rag_store
EMBEDDING_MODELis used by the local SentenceTransformers embedder (downloaded automatically from HuggingFace). It is independent from the LM Studio embedding model.
python -m ui.app- In the sidebar, set:
chunk_sizechunk_overlap- optional
Rebuild indexto clear existing vectors
- Upload
.txtor.mdfiles. - Click Index documents.
- Set retrieval controls:
top_k: number of chunks to retrievescore_threshold: minimum top score required to answer
- Ask questions in the chat panel.
- Expand Retrieved chunks (last turn) to inspect evidence used.
- The model is prompted to answer only from retrieved context.
- Responses should cite chunk ids in square brackets, for example
[mydoc.md-0-a1b2c3d4e5f6]. - If no chunks are retrieved, or retrieval confidence is below
score_threshold, the app abstains with:I don't have enough evidence in the uploaded documents.
- If LM Studio is down or unreachable, the app returns a friendly warning in chat instead of crashing.
- The vector store persists locally at
VECTOR_STORE_DIR.