i2s/plan.md at main · mcvine/i2s

Project Design

Project Scope & Goals

Core Objective: Fine-tune a CodeLlama (or similar) model to translate plain English descriptions of neutron experiments into valid MCViNE instrument python file and XML files for sample assemblies.

First step: focus on sample xmls as they are harder to write manually

For each "sample assembly" directory (e.g. sampleassembly-examples/Al-E_Q-kernel), the generator should produce multiple related files in that same directory, not just one XML. At minimum this typically includes:

a top-level sampleassembly.xml
one or more material/atom coordinate files (e.g. Al.xyz)
one or more scatterer definition XML files referenced by sampleassembly.xml (e.g. Al-scatterer.xml)

Success Criteria:

Generate syntactically valid XML (parses correctly)
Generate semantically correct XML (produces physically meaningful simulations)
Achieve >80% acceptance rate on a held-out test set (compared to human-written XML)

Technical Implementation Strategy

Phase 1: Data Generation

Step 1: Create a Parameterized XML Template Library

Catalog common sample configurations
Identify the key XML blocks
Create base templates with semantic placeholders

Step 2: Build a Synthetic Data Generator

# Pseudocode for your generator
samples = ["vanadium", "water", "phonon powder", "single crystal spinwave"]
physics_params = {
    "temperature": [10, 100, 300],    # K
    ...
}

for combination in product(instruments, samples, physics_params):
    # Generate English description
    description = generate_natural_language(combination)
    
    # Fill XML template based on combination
    xml = fill_template(combination)
    
    # Save pair
    dataset.append({"instruction": description, "output": xml})

Step 3: Add Complexity Gradually

Start with single-component descriptions
Progress to multi-component ("I want a SNAP configuration with a vanadium sample at 100K and high resolution")
Include edge cases and common user mistakes
Add negative examples (descriptions that should fail)

Target Dataset Size: 10,000-50,000 examples (quality over quantity)

Phase 2: Model Selection & Fine-tuning Strategy

Base Model Choice:

CodeLlama-7B or 13B (strong code generation, permissive license)
DeepSeek-Coder (another strong option)
Mistral-7B (general purpose, good instruction following)

Fine-tuning Approach:

# Use QLoRA for efficiency (train on 1-2 GPUs)
# Key configuration:
- 4-bit quantization
- LoRA rank = 64
- Target modules: q_proj, v_proj, k_proj, o_proj
- Batch size: Start small and scale

Training Framework:

Use Axolotl or HuggingFace TRL (SFTTrainer)
These handle the complexity of QLoRA, data formatting, and checkpointing

Phase 3: Validation & Evaluation (Critical)

This separates a toy project from a serious portfolio piece.

Automated Validation Pipeline:

def validate_xml(generated_xml, ground_truth_xml):
    checks = {
        "syntactic": can_parse(generated_xml),
        "semantic": compare_simulation_output(generated_xml, ground_truth_xml),
        "physics": check_energy_conservation(generated_xml),
        "completeness": all_required_components_present(generated_xml)
    }
    return checks

Test Set Design:

20% hold-out from synthetic data
50 hand-written examples from your expertise
20 "challenge" cases that require reasoning

Phase 4: Demonstration (The Portfolio Piece)

Create a simple interface that showcases the work:

# Gradio app example
import gradio as gr

def generate_instrument(description):
    xml = model.generate(description)
    validation = run_mcvine_check(xml)  # Quick validation
    return xml, validation

gr.Interface(
    fn=generate_instrument,
    inputs="text",
    outputs=["code", "json"],
    title="MCViNE Natural Language Interface",
    description="Describe your neutron experiment in plain English"
).launch()

Timeline Estimate (Part-time)

Phase	Duration	Key Deliverable
Data Pipeline	2-3 weeks	Generator script + 10k examples
Training Setup	1 week	Working fine-tuning environment
Initial Training	1 week	First model checkpoint
Evaluation & Iteration	2 weeks	Validation metrics + improvements
Demo & Documentation	1 week	Gradio app + GitHub repo
Total	7-8 weeks	Complete project portfolio

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Project Design

Project Scope & Goals

Technical Implementation Strategy

Phase 1: Data Generation

Phase 2: Model Selection & Fine-tuning Strategy

Phase 3: Validation & Evaluation (Critical)

Phase 4: Demonstration (The Portfolio Piece)

Timeline Estimate (Part-time)

FilesExpand file tree

plan.md

Latest commit

History

plan.md

File metadata and controls

Project Design

Project Scope & Goals

Technical Implementation Strategy

Phase 1: Data Generation

Phase 2: Model Selection & Fine-tuning Strategy

Phase 3: Validation & Evaluation (Critical)

Phase 4: Demonstration (The Portfolio Piece)

Timeline Estimate (Part-time)