This project implements an N-gram language model for Java code completion. The model predicts the next token in code by learning probability distributions from training data.
- Python 3.9+
- Required packages: pandas, pygments, scikit-learn, tqdm
# Clone repository
git clone https://github.com/BenNormann/code-n-gram
cd code-n-gram
# Install dependencies
pip install -r requirements.txtOption 1: Using a CSV file with Java methods
python src/main.py --processed_file data/processed_methods.csvOption 2: Using a text file (one method per line)
python src/main.py --training_txt data/training.txt --output_dir results_teacherFor faster experimentation, use the --quick flag to train on a smaller sample:
python src/main.py --processed_file data/methods.csv --quickThis will:
- Use a random sample of 1000 methods for model selection
- Still train the final model on the full dataset
- Still evaluate on the full test set
- Significantly reduce training time while maintaining final model quality
--output_dir ./results # Output directory (default: ./results)
--min_n 2 # Minimum n-gram size (default: 2)
--max_n 8 # Maximum n-gram size (default: 8)
--smoothing_k 0.01 # Smoothing parameter (default: 0.01)python src/main.py --processed_file data/methods.csv --quick --min_n 3 --max_n 6 --smoothing_k 0.05Results are saved to the specified output directory:
metrics.json: Final model metrics (perplexity, accuracy, vocabulary size)metrics_compare.json: Metrics for all tested n-gram sizesbest_model.pkl: Serialized best model
Lower perplexity values and higher accuracy indicate better model performance.