Skip to content

Latest commit

 

History

History
141 lines (105 loc) · 4.42 KB

File metadata and controls

141 lines (105 loc) · 4.42 KB

Project Design

Project Scope & Goals

Core Objective: Fine-tune a CodeLlama (or similar) model to translate plain English descriptions of neutron experiments into valid MCViNE instrument python file and XML files for sample assemblies.

First step: focus on sample xmls as they are harder to write manually

For each "sample assembly" directory (e.g. sampleassembly-examples/Al-E_Q-kernel), the generator should produce multiple related files in that same directory, not just one XML. At minimum this typically includes:

  • a top-level sampleassembly.xml
  • one or more material/atom coordinate files (e.g. Al.xyz)
  • one or more scatterer definition XML files referenced by sampleassembly.xml (e.g. Al-scatterer.xml)

Success Criteria:

  • Generate syntactically valid XML (parses correctly)
  • Generate semantically correct XML (produces physically meaningful simulations)
  • Achieve >80% acceptance rate on a held-out test set (compared to human-written XML)

Technical Implementation Strategy

Phase 1: Data Generation

Step 1: Create a Parameterized XML Template Library

  • Catalog common sample configurations
  • Identify the key XML blocks
  • Create base templates with semantic placeholders

Step 2: Build a Synthetic Data Generator

# Pseudocode for your generator
samples = ["vanadium", "water", "phonon powder", "single crystal spinwave"]
physics_params = {
    "temperature": [10, 100, 300],    # K
    ...
}

for combination in product(instruments, samples, physics_params):
    # Generate English description
    description = generate_natural_language(combination)
    
    # Fill XML template based on combination
    xml = fill_template(combination)
    
    # Save pair
    dataset.append({"instruction": description, "output": xml})

Step 3: Add Complexity Gradually

  • Start with single-component descriptions
  • Progress to multi-component ("I want a SNAP configuration with a vanadium sample at 100K and high resolution")
  • Include edge cases and common user mistakes
  • Add negative examples (descriptions that should fail)

Target Dataset Size: 10,000-50,000 examples (quality over quantity)


Phase 2: Model Selection & Fine-tuning Strategy

Base Model Choice:

  • CodeLlama-7B or 13B (strong code generation, permissive license)
  • DeepSeek-Coder (another strong option)
  • Mistral-7B (general purpose, good instruction following)

Fine-tuning Approach:

# Use QLoRA for efficiency (train on 1-2 GPUs)
# Key configuration:
- 4-bit quantization
- LoRA rank = 64
- Target modules: q_proj, v_proj, k_proj, o_proj
- Batch size: Start small and scale

Training Framework:

  • Use Axolotl or HuggingFace TRL (SFTTrainer)
  • These handle the complexity of QLoRA, data formatting, and checkpointing

Phase 3: Validation & Evaluation (Critical)

This separates a toy project from a serious portfolio piece.

Automated Validation Pipeline:

def validate_xml(generated_xml, ground_truth_xml):
    checks = {
        "syntactic": can_parse(generated_xml),
        "semantic": compare_simulation_output(generated_xml, ground_truth_xml),
        "physics": check_energy_conservation(generated_xml),
        "completeness": all_required_components_present(generated_xml)
    }
    return checks

Test Set Design:

  • 20% hold-out from synthetic data
  • 50 hand-written examples from your expertise
  • 20 "challenge" cases that require reasoning

Phase 4: Demonstration (The Portfolio Piece)

Create a simple interface that showcases the work:

# Gradio app example
import gradio as gr

def generate_instrument(description):
    xml = model.generate(description)
    validation = run_mcvine_check(xml)  # Quick validation
    return xml, validation

gr.Interface(
    fn=generate_instrument,
    inputs="text",
    outputs=["code", "json"],
    title="MCViNE Natural Language Interface",
    description="Describe your neutron experiment in plain English"
).launch()

Timeline Estimate (Part-time)

Phase Duration Key Deliverable
Data Pipeline 2-3 weeks Generator script + 10k examples
Training Setup 1 week Working fine-tuning environment
Initial Training 1 week First model checkpoint
Evaluation & Iteration 2 weeks Validation metrics + improvements
Demo & Documentation 1 week Gradio app + GitHub repo
Total 7-8 weeks Complete project portfolio