A lab to design, run, and compare Retrieval-Augmented Generation (RAG) experiments in a repeatable way.
Base repository structure:
datasets/: versions of evaluation datasets and/or source corpora.rag_core/: core logic for chunking, retrieval, prompting, and shared contracts.experiments/: run definitions and experiment artifacts per configuration.evaluation/: metrics, scripts, and reports used to compare results.ui/: result visualization, dashboards, or inspection tools.docs/: guides, conceptual contracts, and design decisions.
Why this comes first: without a common skeleton, each experiment tends to invent its own structure, and A/B comparisons become ambiguous and expensive.
dataset_id: stable identifier for a dataset (for example:faq_en_v1).run_id: unique execution identifier (for example:2026-02-21_hybrid_bm25_k20).doc_id: source document ID inside a dataset (for example:doc_000123).chunk_id: chunk ID derived fromdoc_id+ chunking policy (for example:doc_000123_c004).
Practical rules:
- Avoid spaces and uppercase letters.
- Prefer
snake_case. - Keep
run_idreadable (date + strategy + key parameter).
- Leave explicit slots (folders and contracts) to add retrievers, rerankers, and prompt strategies.
- Avoid premature abstraction until at least two real variants justify a shared layer.
An environment to iterate on RAG with experimental discipline: same data, controlled configuration changes, and systematic metric comparison.
- Identify RAG configurations that improve factual quality and traceability.
- Measure trade-offs between retrieval coverage, citation quality, and responsible abstention.
- Prepare or select a
dataset_id. - Define an experiment configuration (chunking + retrieval + prompt contract).
- Execute a run and record
run_id. - Evaluate outputs with standardized metrics.
- Keep the dataset and question set fixed.
- Change one primary dimension at a time per comparison (for example chunk size or retriever).
- Report side-by-side metrics under the same evaluation protocol.
- recall@k: share of cases where relevant context appears in top-k retrieved items.
- citation precision: fraction of model citations that actually support the corresponding claim.
- abstention: ability to avoid answering when evidence is insufficient.
Quick interpretation:
- High recall with low citation precision may indicate grounding noise.
- High citation precision with low recall may indicate insufficient coverage.
- Good abstention reduces hallucinations, but too much abstention can hurt usefulness.
- Comparing runs built on different datasets without explicit labeling.
- Changing multiple variables in an A/B test and losing attribution.
- Measuring only textual correctness without citation quality checks.
- Ignoring “no answer” cases during performance analysis.
See details in docs/config_contract.md.
Minimum conceptual contract every configuration should declare:
chunking: how content is split and referenced.retrieval: how evidence is retrieved (and with which parameters).prompt_contract: required response format and citation/abstention behavior.
Why define this early: if this contract is not fixed early, experiments are hard to compare reliably even when they look similar.
This repository now includes a minimal rag_core/ implementation that covers:
- Ingestion + normalization with
dataset_id,doc_id,version,section_idmetadata. - Two chunking strategies:
fixed_sizeandby_headings. - Deterministic
chunk_idgeneration. - Persistent vector index versioned by
(dataset + chunking config)underexperiments/indexes/. - Basic retriever with
top_kand score traces inexperiments/traces/. - Context builder + response contract with citation IDs and a simple abstention policy.
Quick run example:
python -m rag_core.run_pipeline \
--dataset-id dataset3_hierarchical_manual \
--docs-path datasets/dataset3_hierarchical_manual/docs.json \
--chunking-strategy by_headings \
--question "How do I reset the unit?"Install dependencies:
pip install -e .Configure LLM connection variables:
cp .env.example .env
# then edit .env with your keys/provider/modelThe CLI loads .env automatically and reports the configured provider/model.
Current implementation uses LangChain + OpenAI embeddings for vectorization and retrieval scoring.
Run batch A/B experiments over a fixed question set and persist a reproducible run record:
python -m rag_core.run_experiments \
--run-id 2026-02-22_chunking_ab \
--dataset-id dataset1_regulations_versions \
--docs-path datasets/dataset1_regulations_versions/docs.json \
--questions-path datasets/dataset1_regulations_versions/questions.json \
--config-a-chunking-strategy fixed_size \
--config-b-chunking-strategy by_headingsThis generates experiments/run_records/<run_id>/run_record.json including:
- config snapshot per variant (A/B),
- per-question traces and responses,
- aggregated metrics (
answered,abstained,citation_hit_rate), context_sent_to_llmsaved as chunk IDs (not full text) for lean reproducibility.
To ensure every run can be compared from day one, track this minimum KPI set:
- Evidence Recall@k (
evidence_recall_at_k): fraction of questions where expected evidence appears in retrieved top-k context. - Citation Precision (
citation_precision): average fraction of cited chunks that match expected evidence. - Answer Correctness (simple) (
answer_correctness): lightweight rubric scored as0/1/2.0: incorrect or unsupported answer.1: partially correct (token overlap heuristic).2: correct (expected answer content matched).
- Abstention Correctness (
abstention_correctness): whether the model abstained when it should (and answered when it should not abstain).
Why this matters: without shared metrics, teams cannot learn comparatively across iterations.
Para inspección rápida de fallas en ~30 segundos, hay un panel Gradio con cuatro pantallas:
- A/B Config + selector dataset
- Top-k chunks + scores + metadata
- Respuesta + citas + abstención
- Métricas + delta A/B + ejemplos
Ejecutar:
python -m ui.gradio_lab_panelSugerencia de uso: primero lectura/inspección (entender por qué falla), luego mejoras visuales.
Use the dataset builder CLI to generate a dataset folder from .txt/.md files or a source .json payload:
python -m rag_core.build_dataset \
--dataset-id my_manual_dataset \
--input-path path/to/source_docs \
--out-dir datasets/my_manual_datasetValidate the generated files:
python -m rag_core.validate_dataset \
--dataset-id my_manual_dataset \
--docs-path datasets/my_manual_dataset/docs.json \
--questions-path datasets/my_manual_dataset/questions.jsonFor manual curation, run the Gradio dataset builder UI:
python -m ui.dataset_builder_appIn the UI you can:
- set
dataset_idand export path (datasets/<dataset_id>), - upload multiple
.md/.txtfiles or add a pasted document, - preview docs and sections before export,
- add questions (
Q1,Q2, ...) with optional expected evidence links, - export and run built-in schema validation (PASS/FAIL).
Use the generated dataset with experiments:
python -m rag_core.run_experiments \
--run-id 2026-03-01_my_dataset_ab \
--dataset-id my_manual_dataset \
--docs-path datasets/my_manual_dataset/docs.json \
--questions-path datasets/my_manual_dataset/questions.json \
--config-a-chunking-strategy fixed_size \
--config-b-chunking-strategy by_headingsIf you want a direct guide to run a full cycle (hypothesis -> A/B run -> metrics -> decision), see:
It includes copy/paste commands and a minimal template to document decisions per run_id.