RAG Experiments Lab

A lab to design, run, and compare Retrieval-Augmented Generation (RAG) experiments in a repeatable way.

1) Repository foundations

Repository skeleton + conventions

Base repository structure:

datasets/: versions of evaluation datasets and/or source corpora.
rag_core/: core logic for chunking, retrieval, prompting, and shared contracts.
experiments/: run definitions and experiment artifacts per configuration.
evaluation/: metrics, scripts, and reports used to compare results.
ui/: result visualization, dashboards, or inspection tools.
docs/: guides, conceptual contracts, and design decisions.

Why this comes first: without a common skeleton, each experiment tends to invent its own structure, and A/B comparisons become ambiguous and expensive.

Naming conventions

dataset_id: stable identifier for a dataset (for example: faq_en_v1).
run_id: unique execution identifier (for example: 2026-02-21_hybrid_bm25_k20).
doc_id: source document ID inside a dataset (for example: doc_000123).
chunk_id: chunk ID derived from doc_id + chunking policy (for example: doc_000123_c004).

Practical rules:

Avoid spaces and uppercase letters.
Prefer snake_case.
Keep run_id readable (date + strategy + key parameter).

Extensibility without over-architecture

Leave explicit slots (folders and contracts) to add retrievers, rerankers, and prompt strategies.
Avoid premature abstraction until at least two real variants justify a shared layer.

2) Pedagogical README skeleton

What is this lab?

An environment to iterate on RAG with experimental discipline: same data, controlled configuration changes, and systematic metric comparison.

Objective

Identify RAG configurations that improve factual quality and traceability.
Measure trade-offs between retrieval coverage, citation quality, and responsible abstention.

How to run (suggested flow)

Prepare or select a dataset_id.
Define an experiment configuration (chunking + retrieval + prompt contract).
Execute a run and record run_id.
Evaluate outputs with standardized metrics.

How to compare A/B

Keep the dataset and question set fixed.
Change one primary dimension at a time per comparison (for example chunk size or retriever).
Report side-by-side metrics under the same evaluation protocol.

How to read results

recall@k: share of cases where relevant context appears in top-k retrieved items.
citation precision: fraction of model citations that actually support the corresponding claim.
abstention: ability to avoid answering when evidence is insufficient.

Quick interpretation:

High recall with low citation precision may indicate grounding noise.
High citation precision with low recall may indicate insufficient coverage.
Good abstention reduces hallucinations, but too much abstention can hurt usefulness.

Common mistakes (from day one)

Comparing runs built on different datasets without explicit labeling.
Changing multiple variables in an A/B test and losing attribution.
Measuring only textual correctness without citation quality checks.
Ignoring “no answer” cases during performance analysis.

3) Configuration model (mental schema)

See details in docs/config_contract.md.

Minimum conceptual contract every configuration should declare:

chunking: how content is split and referenced.
retrieval: how evidence is retrieved (and with which parameters).
prompt_contract: required response format and citation/abstention behavior.

Why define this early: if this contract is not fixed early, experiments are hard to compare reliably even when they look similar.

4) Base pipeline implementation (MVP)

This repository now includes a minimal rag_core/ implementation that covers:

Ingestion + normalization with dataset_id, doc_id, version, section_id metadata.
Two chunking strategies: fixed_size and by_headings.
Deterministic chunk_id generation.
Persistent vector index versioned by (dataset + chunking config) under experiments/indexes/.
Basic retriever with top_k and score traces in experiments/traces/.
Context builder + response contract with citation IDs and a simple abstention policy.

Quick run example:

python -m rag_core.run_pipeline \
  --dataset-id dataset3_hierarchical_manual \
  --docs-path datasets/dataset3_hierarchical_manual/docs.json \
  --chunking-strategy by_headings \
  --question "How do I reset the unit?"

5) Environment and dependencies

Install dependencies:

pip install -e .

Configure LLM connection variables:

cp .env.example .env
# then edit .env with your keys/provider/model

The CLI loads .env automatically and reports the configured provider/model.

Current implementation uses LangChain + OpenAI embeddings for vectorization and retrieval scoring.

6) Experiment runner + run records

Run batch A/B experiments over a fixed question set and persist a reproducible run record:

python -m rag_core.run_experiments \
  --run-id 2026-02-22_chunking_ab \
  --dataset-id dataset1_regulations_versions \
  --docs-path datasets/dataset1_regulations_versions/docs.json \
  --questions-path datasets/dataset1_regulations_versions/questions.json \
  --config-a-chunking-strategy fixed_size \
  --config-b-chunking-strategy by_headings

This generates experiments/run_records/<run_id>/run_record.json including:

config snapshot per variant (A/B),
per-question traces and responses,
aggregated metrics (answered, abstained, citation_hit_rate),
context_sent_to_llm saved as chunk IDs (not full text) for lean reproducibility.

7) Minimal evaluation KPI set

To ensure every run can be compared from day one, track this minimum KPI set:

Evidence Recall@k (evidence_recall_at_k): fraction of questions where expected evidence appears in retrieved top-k context.
Citation Precision (citation_precision): average fraction of cited chunks that match expected evidence.
Answer Correctness (simple) (answer_correctness): lightweight rubric scored as 0/1/2.
- 0: incorrect or unsupported answer.
- 1: partially correct (token overlap heuristic).
- 2: correct (expected answer content matched).
Abstention Correctness (abstention_correctness): whether the model abstained when it should (and answered when it should not abstain).

Why this matters: without shared metrics, teams cannot learn comparatively across iterations.

8) UI Gradio como panel de laboratorio

Para inspección rápida de fallas en ~30 segundos, hay un panel Gradio con cuatro pantallas:

A/B Config + selector dataset
Top-k chunks + scores + metadata
Respuesta + citas + abstención
Métricas + delta A/B + ejemplos

Ejecutar:

python -m ui.gradio_lab_panel

Sugerencia de uso: primero lectura/inspección (entender por qué falla), luego mejoras visuales.

9) Build a dataset (CLI)

Use the dataset builder CLI to generate a dataset folder from .txt/.md files or a source .json payload:

python -m rag_core.build_dataset \
  --dataset-id my_manual_dataset \
  --input-path path/to/source_docs \
  --out-dir datasets/my_manual_dataset

Validate the generated files:

python -m rag_core.validate_dataset \
  --dataset-id my_manual_dataset \
  --docs-path datasets/my_manual_dataset/docs.json \
  --questions-path datasets/my_manual_dataset/questions.json

10) Build a dataset (UI)

For manual curation, run the Gradio dataset builder UI:

python -m ui.dataset_builder_app

In the UI you can:

set dataset_id and export path (datasets/<dataset_id>),
upload multiple .md / .txt files or add a pasted document,
preview docs and sections before export,
add questions (Q1, Q2, ...) with optional expected evidence links,
export and run built-in schema validation (PASS/FAIL).

Use the generated dataset with experiments:

python -m rag_core.run_experiments \
  --run-id 2026-03-01_my_dataset_ab \
  --dataset-id my_manual_dataset \
  --docs-path datasets/my_manual_dataset/docs.json \
  --questions-path datasets/my_manual_dataset/questions.json \
  --config-a-chunking-strategy fixed_size \
  --config-b-chunking-strategy by_headings

11) Operational playbook (step by step)

If you want a direct guide to run a full cycle (hypothesis -> A/B run -> metrics -> decision), see:

docs/operational_playbook.md

It includes copy/paste commands and a minimal template to document decisions per run_id.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG Experiments Lab

1) Repository foundations

Repository skeleton + conventions

Naming conventions

Extensibility without over-architecture

2) Pedagogical README skeleton

What is this lab?

Objective

How to run (suggested flow)

How to compare A/B

How to read results

Common mistakes (from day one)

3) Configuration model (mental schema)

4) Base pipeline implementation (MVP)

5) Environment and dependencies

6) Experiment runner + run records

7) Minimal evaluation KPI set

8) UI Gradio como panel de laboratorio

9) Build a dataset (CLI)

10) Build a dataset (UI)

11) Operational playbook (step by step)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
datasets		datasets
docs		docs
evaluation		evaluation
experiments		experiments
rag_core		rag_core
tests		tests
ui		ui
.env		.env
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

RAG Experiments Lab

1) Repository foundations

Repository skeleton + conventions

Naming conventions

Extensibility without over-architecture

2) Pedagogical README skeleton

What is this lab?

Objective

How to run (suggested flow)

How to compare A/B

How to read results

Common mistakes (from day one)

3) Configuration model (mental schema)

4) Base pipeline implementation (MVP)

5) Environment and dependencies

6) Experiment runner + run records

7) Minimal evaluation KPI set

8) UI Gradio como panel de laboratorio

9) Build a dataset (CLI)

10) Build a dataset (UI)

11) Operational playbook (step by step)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages