Core Objective: Fine-tune a CodeLlama (or similar) model to translate plain English descriptions of neutron experiments into valid MCViNE instrument python file and XML files for sample assemblies.
First step: focus on sample xmls as they are harder to write manually
For each "sample assembly" directory (e.g. sampleassembly-examples/Al-E_Q-kernel), the generator should produce multiple related files in that same directory, not just one XML. At minimum this typically includes:
- a top-level
sampleassembly.xml - one or more material/atom coordinate files (e.g.
Al.xyz) - one or more scatterer definition XML files referenced by
sampleassembly.xml(e.g.Al-scatterer.xml)
Success Criteria:
- Generate syntactically valid XML (parses correctly)
- Generate semantically correct XML (produces physically meaningful simulations)
- Achieve >80% acceptance rate on a held-out test set (compared to human-written XML)
Step 1: Create a Parameterized XML Template Library
- Catalog common sample configurations
- Identify the key XML blocks
- Create base templates with semantic placeholders
Step 2: Build a Synthetic Data Generator
# Pseudocode for your generator
samples = ["vanadium", "water", "phonon powder", "single crystal spinwave"]
physics_params = {
"temperature": [10, 100, 300], # K
...
}
for combination in product(instruments, samples, physics_params):
# Generate English description
description = generate_natural_language(combination)
# Fill XML template based on combination
xml = fill_template(combination)
# Save pair
dataset.append({"instruction": description, "output": xml})Step 3: Add Complexity Gradually
- Start with single-component descriptions
- Progress to multi-component ("I want a SNAP configuration with a vanadium sample at 100K and high resolution")
- Include edge cases and common user mistakes
- Add negative examples (descriptions that should fail)
Target Dataset Size: 10,000-50,000 examples (quality over quantity)
Base Model Choice:
- CodeLlama-7B or 13B (strong code generation, permissive license)
- DeepSeek-Coder (another strong option)
- Mistral-7B (general purpose, good instruction following)
Fine-tuning Approach:
# Use QLoRA for efficiency (train on 1-2 GPUs)
# Key configuration:
- 4-bit quantization
- LoRA rank = 64
- Target modules: q_proj, v_proj, k_proj, o_proj
- Batch size: Start small and scaleTraining Framework:
- Use Axolotl or HuggingFace TRL (SFTTrainer)
- These handle the complexity of QLoRA, data formatting, and checkpointing
This separates a toy project from a serious portfolio piece.
Automated Validation Pipeline:
def validate_xml(generated_xml, ground_truth_xml):
checks = {
"syntactic": can_parse(generated_xml),
"semantic": compare_simulation_output(generated_xml, ground_truth_xml),
"physics": check_energy_conservation(generated_xml),
"completeness": all_required_components_present(generated_xml)
}
return checksTest Set Design:
- 20% hold-out from synthetic data
- 50 hand-written examples from your expertise
- 20 "challenge" cases that require reasoning
Create a simple interface that showcases the work:
# Gradio app example
import gradio as gr
def generate_instrument(description):
xml = model.generate(description)
validation = run_mcvine_check(xml) # Quick validation
return xml, validation
gr.Interface(
fn=generate_instrument,
inputs="text",
outputs=["code", "json"],
title="MCViNE Natural Language Interface",
description="Describe your neutron experiment in plain English"
).launch()| Phase | Duration | Key Deliverable |
|---|---|---|
| Data Pipeline | 2-3 weeks | Generator script + 10k examples |
| Training Setup | 1 week | Working fine-tuning environment |
| Initial Training | 1 week | First model checkpoint |
| Evaluation & Iteration | 2 weeks | Validation metrics + improvements |
| Demo & Documentation | 1 week | Gradio app + GitHub repo |
| Total | 7-8 weeks | Complete project portfolio |