Ischemic Heart Disease Risk Prediction Logistic Regression on Synthetic Population Data
Project Overview
This project simulates a real-world cardiovascular risk modeling pipeline using synthetic population data. We generate a structured dataset representing demographic and clinical features, then train a Logistic Regression model to predict the probability of Ischemic Heart Disease (IHD).
The goal is to demonstrate:
Synthetic data generation Probabilistic risk modeling Model evaluation using ROC-AUC Model serialization for deployment
Problem Statement
Predict the probability of Ischemic Heart Disease (IHD) based on:
Age Sex BMI Systolic Blood Pressure Diabetes status Smoking status Weekly physical activity This mimics traditional epidemiological risk scoring systems.
Project Architecture Synthetic Population Generator ↓ Feature Engineering ↓ Train/Test Split (75/25) ↓ Logistic Regression Model ↓ ROC-AUC Evaluation ↓ Model Serialization (.pkl)
Dataset Generation
A synthetic dataset of 15,000 individuals is generated using probabilistic modeling.
Feature Distribution Design
Feature Distribution Age Uniform (25–80) Sex Bernoulli (p=0.52 male) BMI Normal(μ=25, σ=4) SBP Normal (μ=128, σ=15) Diabetes Bernoulli (p=0.10) Smoking Bernoulli (p=0.20) Weekly Exercise Exponential (scale=90)
Risk probability is computed using a linear logit model: logit(p)= intercept+∑(Xi⋅βi) Final outcome is sampled via Bernoulli distribution.
Model
Algorithm: Logistic Regression Max Iterations: 2000 Train/Test Split: 75% / 25% Evaluation Metric: ROC-AUC Logistic Regression was chosen because: Interpretable coefficients Suitable for binary risk modeling Common in clinical risk prediction
Evaluation
Model performance is measured using: ROC-AUC Score Example Output: AUC: 0.86 Model saved!
AUC provides a threshold-independent measure of classification quality.
Model Persistence
The trained model is serialized using pickle and saved as: model/logreg_model.pkl This allows deployment in a future API or clinical decision support tool.
Project Structure Heart_Disease_Risk/ │ ├── data/ │ └── synthetic_population.csv │ ├── model/ │ └── logreg_model.pkl │ ├── risk_model.py └── README.md
Installation pip install numpy pandas scikit-learn Run the Project python risk_model.py
This will: Generate synthetic dataset Train logistic regression model Print ROC-AUC
Save trained model
Key Learnings
Generating structured synthetic medical datasets Probabilistic modeling using logistic regression Model evaluation with ROC curves Binary classification in healthcare context
Saving trained models for deployment
Real-World Applications
Cardiovascular risk screening Preventive healthcare analytics Insurance risk scoring Clinical decision support systems
Future Improvements
Add feature scaling Implement cross-validation Add SHAP for interpretability Build Streamlit dashboard Deploy as REST API
Cardiovascular Risk Prediction using Logistic Regression Built a synthetic population dataset (15k samples) and trained a logistic regression model to predict ischemic heart disease risk, achieving strong ROC-AUC performance and saving model for deployment.