--- language: en license: mit library_name: scikit-learn pipeline_tag: tabular-classification tags: - scikit-learn - tabular-classification - machine-learning - healthcare - heart-disease - cardiovascular - binary-classification - random-forest metrics: - accuracy - precision - recall - f1 model-index: - name: Heart Disease Prediction (RandomForestClassifier) results: - task: type: tabular-classification name: Tabular Classification dataset: name: heart-disease.csv (included in project repo) type: csv metrics: - name: Test Accuracy type: accuracy value: 0.87 - name: 5-fold CV Accuracy (mean) type: accuracy value: 0.8479781421 - name: 5-fold CV Precision (mean) type: precision value: 0.8215873016 - name: 5-fold CV Recall (mean) type: recall value: 0.9272727273 - name: 5-fold CV F1 (mean) type: f1 value: 0.8705403543 --- # ❤️ Heart Disease Prediction (scikit-learn Random Forest) A classic machine learning model that predicts the likelihood of **heart disease** from structured patient medical attributes (tabular data). This repository contains a **joblib-serialized scikit-learn model** trained and evaluated in an end-to-end Jupyter Notebook workflow. ## Model Details - **Developed by:** brej-29 - **Model type:** `RandomForestClassifier` (scikit-learn) - **Task:** Binary classification (tabular) - **Output labels:** - `0` → No heart disease - `1` → Heart disease present - **Saved artifact:** `heart_disease_model.joblib` - **Training notebook:** `HeartDiseasePredictionProject.ipynb` - **Source code / project repo:** https://github.com/brej-29/Logicmojo-AIML-Assignments-heart-disease-prediction-ml - **License:** MIT ## Intended Use ### Direct Use - Educational / portfolio demonstration of an end-to-end ML pipeline: - EDA → modeling → hyperparameter tuning → evaluation → persistence - Research prototyping and experimentation with classical ML on healthcare-like tabular data ### Out-of-Scope Use (Important) - **Not for clinical diagnosis** - **Not a medical device** - **Not validated for real-world patient care** - Do not use this model as the sole basis for medical decisions. ## Training Data The model was trained on a tabular dataset included in the project repository as `heart-disease.csv`. - **Rows:** 303 - **Columns:** 14 - **Features:** 13 - **Target:** 1 (`target`) - **Target distribution:** - `1`: 165 - `0`: 138 ### Features (Input Schema) The model expects **13 columns**: | Feature | Description | |---|---| | `age` | Age in years | | `sex` | Sex (commonly encoded as 1 = male, 0 = female) | | `cp` | Chest pain type (categorical encoded as integers) | | `trestbps` | Resting blood pressure | | `chol` | Serum cholesterol | | `fbs` | Fasting blood sugar (binary) | | `restecg` | Resting ECG results (categorical encoded as integers) | | `thalach` | Maximum heart rate achieved | | `exang` | Exercise-induced angina (binary) | | `oldpeak` | ST depression induced by exercise relative to rest | | `slope` | Slope of peak exercise ST segment (categorical encoded as integers) | | `ca` | Number of major vessels (categorical encoded as integers) | | `thal` | Thalassemia category (categorical encoded as integers) | ## Training Procedure ### Data Split - `train_test_split(test_size=0.2)` - Randomness controlled via `np.random.seed(42)` in the notebook ### Candidate Models Explored - Logistic Regression - K-Nearest Neighbors (KNN) - Random Forest ### Hyperparameter Tuning - `RandomizedSearchCV` used to tune Random Forest - `cv=5`, `n_iter=20` - Best Random Forest hyperparameters found in the notebook: - `n_estimators=210` - `max_depth=3` - `min_samples_split=4` - `min_samples_leaf=19` ### Final Model The saved model (`heart_disease_model.joblib`) corresponds to: - `RandomForestClassifier(n_estimators=210, max_depth=3, min_samples_split=4, min_samples_leaf=19)` ## Evaluation ### Baseline Test Accuracy (single 80/20 split) - KNN: ~0.689 - Logistic Regression: ~0.885 - Random Forest: ~0.836 ### Final Model Performance - **Loaded saved model test accuracy:** **0.87** ### Cross-Validated Metrics (5-fold mean) — Final Random Forest - **Accuracy:** 0.8479781421 - **Precision:** 0.8215873016 - **Recall:** 0.9272727273 - **F1:** 0.8705403543 Note: The notebook also visualizes confusion matrices and ROC curves for model comparison. ## How to Use ### 1) Install dependencies - `pip install scikit-learn joblib pandas numpy huggingface_hub` ### 2) Load the model from Hugging Face Hub ```python from huggingface_hub import hf_hub_download import joblib import pandas as pd # Replace with your HF repo id, e.g. "brej-29/heart-disease-prediction-rf" repo_id = "YOUR_USERNAME/YOUR_MODEL_REPO" model_path = hf_hub_download( repo_id=repo_id, filename="heart_disease_model.joblib" ) model = joblib.load(model_path) # Example input (values are placeholders; use correctly-encoded values) sample = pd.DataFrame([{ "age": 57, "sex": 1, "cp": 0, "trestbps": 120, "chol": 354, "fbs": 0, "restecg": 1, "thalach": 163, "exang": 1, "oldpeak": 0.6, "slope": 2, "ca": 0, "thal": 2 }]) pred = model.predict(sample)[0] proba = model.predict_proba(sample)[0, 1] # probability of class "1" print("Prediction:", int(pred)) print("P(heart disease):", float(proba)) ``` ## Input Requirements - Provide all 13 feature columns - Ensure categorical features (`cp`, `restecg`, `slope`, `ca`, `thal`) follow the same integer encoding as used in training - Numeric types should be valid numbers (`int`/`float`); no missing values ## Bias, Risks, and Limitations - Small dataset (303 rows) → results may not generalize to broader populations - Encoding-dependent: categorical values must match training conventions - No clinical validation: metrics are from offline evaluation only - False negatives are possible (missed risk) — do not use for medical screening without rigorous validation ## Environmental Impact Training and evaluation were performed using classical ML methods on a small tabular dataset and are expected to have minimal compute and carbon impact (CPU-only, short runtime). ## Technical Specifications - Framework: scikit-learn - Model format: joblib (`heart_disease_model.joblib`) - Inference type: CPU-friendly tabular prediction ## Model Card Authors - BrejBala ## Contact For questions/feedback, please open an issue on the GitHub repository: https://github.com/brej-29/Logicmojo-AIML-Assignments-heart-disease-prediction-ml