Doclet is a local, privacy-focused RAG (Retrieval-Augmented Generation) assistant that lets you chat with your documents using a small language model running entirely on your CPU. No cloud, no API keys, no data leaving your machine.
Demo Video: Watch on LinkedIn | GitHub
- π 100% Local & Private - All processing happens on your machine
- π Multi-Format Support - Ingest Markdown, TXT, and PDF documents
- π§ Smart Retrieval - ChromaDB vector database with semantic search
- π¬ Interactive Chat UI - Clean Streamlit interface with chat history
- β‘ CPU-Optimized - Runs on CPU using quantized Llama 3.2 1B model
- π― Selective Querying - Choose which documents to include in your queries
- π Source Citations - Every answer includes references with relevance scores
- π Incremental Ingestion - Only processes new or modified documents
- π° Zero Cost - No API subscriptions or cloud services needed
- Python 3.8 or higher
- 4GB+ RAM recommended
- ~2GB disk space for model and dependencies
- No GPU required!
- Note: Close resource-intensive applications (video editors, screen recorders) for optimal performance
-
Clone the repository
git clone https://github.com/Arman001/doclet.git cd doclet -
Set up the environment
chmod +x setup_env.sh ./setup_env.sh
-
Activate the virtual environment
source venv/bin/activate -
Download the language model
python setup_models.py
This downloads the Llama 3.2 1B model (~1.1GB)
streamlit run app.pyThe application will open in your browser at http://localhost:8501
- Click "Browse files" in the sidebar
- Select your
.md,.txt, or.pdffiles - Click "Save & Process" to ingest them
- Click "Load Llama 3.2 (1B)" in the main interface
- Wait for the model to initialize (~10-30 seconds)
- Select which documents you want to query (or use all)
- Type your question in the chat input
- View answers with source citations and relevance scores
- Select Documents: Choose specific documents from the sidebar
- Reset Database: Clear all indexed documents
- Clear Chat: Start a fresh conversation
doclet/
βββ app.py # Main Streamlit application
βββ ingest.py # Document ingestion pipeline
βββ setup_models.py # Model download script
βββ setup_env.sh # Environment setup script
βββ requirements.txt # Python dependencies
βββ docs/ # Your documents go here
βββ models/ # Downloaded LLM models
βββ chroma_db/ # Vector database storage
βββ ingested_files.json # Tracking file for incremental updates
Key parameters can be adjusted in app.py:
MODEL_PATH = "models/Llama-3.2-1B-Instruct-Q4_K_M.gguf"
EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
RELEVANCE_THRESHOLD = 1.3 # Lower = stricter relevance
RETRIEVAL_K = 5 # Number of chunks to retrieve
MAX_CONTEXT_CHUNKS = 3 # Chunks sent to LLMIn ingest.py:
chunk_size = 500 # Characters per chunk
chunk_overlap = 50 # Overlap between chunks-
Document Ingestion
- Documents are loaded and split into chunks
- Each chunk is embedded using
all-MiniLM-L6-v2 - Embeddings are stored in ChromaDB for fast retrieval
-
Query Processing
- User question is embedded using the same model
- Top K relevant chunks are retrieved via cosine similarity search
- Chunks below relevance threshold are filtered out
-
Answer Generation
- Top 3 most relevant chunks are formatted into a prompt
- Llama 3.2 1B generates an answer based on the context
- Response is cleaned to remove artifacts and hallucinations
- Citations with relevance scores are displayed
| System Load | Query Complexity | Response Time |
|---|---|---|
| Light (just Doclet) | Simple | 10-15 seconds |
| Light | Complex | 15-20 seconds |
| Heavy (OBS, browsers, etc.) | Simple | 20-30 seconds |
| Heavy | Complex | 30-45 seconds |
Factors affecting speed:
- CPU: Modern CPUs (2020+) perform better
- RAM availability: 8GB+ recommended
- System load: Close unnecessary applications
- Query complexity: Longer questions take more time
- Context size: More retrieved documents = slower
- Background processes: Screen recording software (OBS) significantly impacts performance
Performance Tips:
- Close OBS Studio and screen recorders before using Doclet
- Close browser tabs you're not using
- Reduce
MAX_CONTEXT_CHUNKSto 2 for faster responses - Use shorter, more specific questions
- Consider upgrading to the 3B model if you have 16GB+ RAM for better quality at similar speeds
- π Technical Documentation: Query your project docs, API references, or manuals
- π¬ Research Notes: Search through research papers and notes
- π§ Knowledge Management: Build a personal knowledge base
- π» Code Documentation: Understand large codebases through documentation
- π Learning: Create study materials and quiz yourself
- π’ Corporate Policies: Query company handbooks and procedures privately
- Hallucination: Small models can occasionally generate plausible-sounding but incorrect information
- Context Window: Limited to ~2K tokens, which restricts complex multi-document reasoning
- Language: Optimized for English; other languages may have reduced accuracy
- Complex Queries: Best for straightforward factual questions rather than creative or abstract reasoning
- Speed: CPU inference takes 10-30 seconds per query depending on system load and complexity (faster without screen recording software running)
- Resource Usage: Performance degrades when running alongside resource-intensive applications (OBS, video editors, etc.)
- Ensure you ran
python setup_models.py - Check that
models/Llama-3.2-1B-Instruct-Q4_K_M.ggufexists - Verify you have enough RAM (4GB+ recommended)
- Try closing other applications to free memory
- Check the relevance threshold (try increasing to 1.5-2.0 in
app.py) - Ensure documents are selected in the sidebar
- Verify documents were successfully ingested (check logs)
- Try shorter, more specific questions
- Reduce
MAX_CONTEXT_CHUNKSto 2 - Use smaller documents or split large ones
- Close other applications to free up RAM and CPU
- Consider using the GPU version if you have NVIDIA GPU
- Lower the
RELEVANCE_THRESHOLDto be more strict - Reduce
MAX_CONTEXT_CHUNKSfor more focused context - Rephrase questions to be more specific
- Check if the information actually exists in your documents
- Add support for DOCX and HTML files
- Implement conversation memory across sessions
- Add GPU acceleration support
- Multi-language document support
- Export chat history
- Fine-tune model on specific domains
- Add web scraping for URL ingestion
- Implement advanced filtering and search
- LangChain: RAG orchestration framework
- llama-cpp-python: CPU-optimized LLM inference
- ChromaDB: Vector database
- Streamlit: Web UI framework
- sentence-transformers: Embedding models
- PyTorch: ML framework (CPU-only)
See requirements.txt for the complete list.
Contributions are welcome! Here's how you can help:
- Hallucination reduction techniques
- Better prompt engineering
- Performance optimizations
- Additional file format support
- UI/UX enhancements
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
Please read CONTRIBUTING.md for details on our code of conduct.
Found a bug or have a feature request? Please open an issue.
This project is licensed under the MIT License - see the LICENSE file for details.
- Meta AI for Llama 3.2
- LangChain for the RAG framework
- ChromaDB for vector storage
- Sentence Transformers for embeddings
- Streamlit for the amazing UI framework
- GitHub Issues: Project Issues
- LinkedIn: Muhammad Saad
- Email: muhammad.saad.ar@gmail.com
If you find this project useful, please consider:
- Giving it a β on GitHub
- Sharing it with others who might benefit
- Contributing improvements
- Reporting bugs or suggesting features
Built with β€οΈ for privacy-conscious developers
Last updated: December 2025