This project implements a fine-tuned CodeT5 model for if-statement condition prediction in Python code. The model learns to predict the missing condition in an if statement by training on Python code examples.
- Python 3.9+
- Required packages: torch, transformers, datasets, evaluate, pandas, numpy, sacrebleu, ast, tqdm, scikit-learn
# Clone repository
git clone https://github.com/BenNormann/if-predictor-CodeT5
cd if-predictor-CodeT5
# Install dependencies
pip install -r requirements.txt
# Extract dataset
mkdir -p data/raw
# get from _____The dataset files are too large to be included in this repository. Please download them separately:
-
Download the
datarawif.zipordatarawif.7zfile from https://drive.google.com/drive/folders/1_4ouIZYNFAlSmozBxOlgoqmPOW4sX5Qk?usp=sharing -
Place the file in the project data/raw directory
-
Extract the dataset should and name as
data/raw/datarawif.jsonl.
Before training the model, you need to preprocess the raw data to extract if-statements and create the training examples:
# Create directory structure for processed data
mkdir -p data/processed
# Run preprocessing script
python src/data_processing/preprocess.pyThis preprocessing will:
- Parse Python code from the raw dataset
- Extract if-statements and their conditions
- Create masked versions by replacing if conditions with
<mask>tokens - Generate input-output pairs for training
- Split the data into train, validation, and test sets (80%/10%/10%)
- Save the processed datasets as CSV files in the
data/processeddirectory
To train the model with default settings:
python src/train.pyThe default settings will:
- Use the processed data in the default location
- Save the model to the default output directory (saved_models)
- Use the Salesforce/codet5-small model
- Train for 5 epochs with standard hyperparameters
If you need to customize training, the following parameters are available (look at config.py):
--data_dir PATH # Path to the processed data directory
--output_dir PATH # Directory to save the trained model and results
--model_name STRING # HuggingFace model name (default: Salesforce/codet5-small)
--num_train_epochs INT # Number of training epochs (default: 5)
--batch_size INT # Training batch size (default: 8)
--eval_batch_size INT # Evaluation batch size (default: 8)
--learning_rate FLOAT # Learning rate (default: 5e-5)
--weight_decay FLOAT # Weight decay (default: 0.01)After training, the following results will be saved to the specified output directory:
- The trained model checkpoint in the
best_modeldirectory - Evaluation metrics in the
results.jsonfile including:- BLEU scores
- CodeBLEU scores
- Exact match accuracy
- Prediction samples
You can evaluate the model on the test set using:
python src/evaluation/evaluate.py --model_dir saved_models/best_modelHigher BLEU scores indicate better model performance.