If-Statement Condition Predictor with CodeT5

1. Introduction
2. Getting Started
3. Running Options
4. Results

1. Introduction

This project implements a fine-tuned CodeT5 model for if-statement condition prediction in Python code. The model learns to predict the missing condition in an if statement by training on Python code examples.

2. Getting Started

Prerequisites

Python 3.9+
Required packages: torch, transformers, datasets, evaluate, pandas, numpy, sacrebleu, ast, tqdm, scikit-learn

Installation

# Clone repository
git clone https://github.com/BenNormann/if-predictor-CodeT5
cd if-predictor-CodeT5

# Install dependencies
pip install -r requirements.txt

# Extract dataset
mkdir -p data/raw
# get from _____

Data Download

The dataset files are too large to be included in this repository. Please download them separately:

Download the datarawif.zip or datarawif.7z file from https://drive.google.com/drive/folders/1_4ouIZYNFAlSmozBxOlgoqmPOW4sX5Qk?usp=sharing
Place the file in the project data/raw directory
Extract the dataset should and name as data/raw/datarawif.jsonl.

3. Running Options

Preprocessing

Before training the model, you need to preprocess the raw data to extract if-statements and create the training examples:

# Create directory structure for processed data
mkdir -p data/processed

# Run preprocessing script
python src/data_processing/preprocess.py

This preprocessing will:

Parse Python code from the raw dataset
Extract if-statements and their conditions
Create masked versions by replacing if conditions with <mask> tokens
Generate input-output pairs for training
Split the data into train, validation, and test sets (80%/10%/10%)
Save the processed datasets as CSV files in the data/processed directory

Training

To train the model with default settings:

python src/train.py

The default settings will:

Use the processed data in the default location
Save the model to the default output directory (saved_models)
Use the Salesforce/codet5-small model
Train for 5 epochs with standard hyperparameters

Advanced Options

If you need to customize training, the following parameters are available (look at config.py):

--data_dir PATH           # Path to the processed data directory
--output_dir PATH         # Directory to save the trained model and results
--model_name STRING       # HuggingFace model name (default: Salesforce/codet5-small)
--num_train_epochs INT    # Number of training epochs (default: 5)
--batch_size INT          # Training batch size (default: 8)
--eval_batch_size INT     # Evaluation batch size (default: 8)
--learning_rate FLOAT     # Learning rate (default: 5e-5)
--weight_decay FLOAT      # Weight decay (default: 0.01)

4. Results

After training, the following results will be saved to the specified output directory:

The trained model checkpoint in the best_model directory
Evaluation metrics in the results.json file including:
- BLEU scores
- CodeBLEU scores
- Exact match accuracy
- Prediction samples

You can evaluate the model on the test set using:

python src/evaluation/evaluate.py --model_dir saved_models/best_model

Higher BLEU scores indicate better model performance.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data/processed		data/processed
saved_models		saved_models
src		src
.gitignore		.gitignore
Project Description.pdf		Project Description.pdf
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

If-Statement Condition Predictor with CodeT5

1. Introduction

2. Getting Started

Prerequisites

Installation

Data Download

3. Running Options

Preprocessing

Training

Advanced Options

4. Results

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

If-Statement Condition Predictor with CodeT5

1. Introduction

2. Getting Started

Prerequisites

Installation

Data Download

3. Running Options

Preprocessing

Training

Advanced Options

4. Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages