In this collaborative coding project, we aim to develop accurate stroke prediction models using machine learning techniques. Our dataset encompasses various essential features, including age, gender, BMI, average glucose level, work type, and smoking status. To achieve reliable predictions, we perform data preprocessing, outlier detection, feature selection, and model training. Through this project, we showcase the practical application of machine learning techniques in stroke prediction, providing valuable insights for early detection and prevention. Our ultimate goal is to accurately identify individuals at risk, contributing to improved healthcare outcomes.
First you need to clone the repository:
git clone https://github.com/ain2002-project/ain2002-project
cd ain2002-projectThe codes for this project has been developed and tested on Python version 3.7.12. We have added a .python-version file to the repository to ensure that the correct version of Python is used. We recommend using pyenv to manage your Python versions. If the python 3.7.12 is installed in your system, you can skip the following steps.
Install python 3.7.12 using pyenv:
pyenv install 3.7.12We should create a virtual environment so that the packages installed for this project do not interfere with the packages installed in the system. To create a virtual environment, run the following command in the root directory of the repository:
python -m venv .venv
.venv/bin/activateTo install the required packages, run the following command in the root directory of the repository:
pip install -r requirements.txtYou can also run the notebook on the Kaggle. The notebook is available here.
To run the codes on the Kaggle, you need to add this competition dataset by kaggle and this dataset by fedesoriano in the data section in the right panel. We have already uploaded these datasets to our kaggle notebook.
If you want to download the data locally, you can download the datasets from kaggle by hand or you can use these commands (this will require you to be authenticated):
kaggle competitions download -c playground-series-s3e2
kaggle datasets download -d fedesoriano/stroke-prediction-datasetAnd unzip them:
unzip stroke-prediction-dataset.zip -d data
unzip playground-series-s3e2.zip -d data/playground-series-s3e2You can run the notebook on any jupyter server (vscode, jupyterlab, by jupyter notebook command, etc.)). If you are using the local environment, you can run the notebook by running the following command in the root directory of the repository:
jupyter notebookYou can run the codes as python files. They are essentially same with the notebook but with less output and no plots. You can run the codes by running the following command in the root directory of the repository:
python train.pyThis will train and save 3 models that can be used in evaluation and inference.
The evaluation script runs evaluation metrics on the validation dataset and makes inference on the competition dataset. You can run it by
python evaluate.pyAnd if you want to see the submission score on kaggle, you can run the following command to upload the submission file:
kaggle competitions submit -c playground-series-s3e2 -f submission.csv -m "Message"If everything goes well, you should get a 0.89624 private score.
Pretrained models will be generated and saved in models folder. Also we have shared the models folder in a github release.