Home

Machine Learning Projects

Featured Projects

Insurance Premium Predictor

  • This project leverages machine learning regression models—Linear Regression, Lasso Regression, Ridge Regression, Random Forest, XGBoost, and LightGBM—to predict premium prices with higher accuracy.
  • Through hyperparameter tuning and cross-validation, we benchmarked their performance using the R² score, while feature importance analysis revealed the key drivers influencing outcomes.
  • These insights not only help in choosing the most reliable model but also enable stakeholders to make data-driven pricing decisions and strategic business improvements.

View Notebook Streamlit App

Delivery Time Prediction

  • Built an end-to-end machine learning regression model to predict delivery time using 175k+ real-world orders, covering data cleaning, feature engineering, modeling, and evaluation.
  • Performed deep EDA and statistical validation (Kruskal-Wallis, Dunn post-hoc, Chi-square tests) to uncover time-of-day, market, and protocol-based delivery patterns.
  • Engineered impactful features from timestamps (delivery sessions, weekday/weekend) and handled outliers using the IQR method to improve model stability.
  • Trained and compared Linear Regression, Random Forest, XGBoost, LightGBM and Neural Networks, selecting LightGBM for its superior generalization.
  • Achieved production-ready performance with RMSE ≈ 1.27 minutes, MAE ≈ 1.14 minutes and R² ≈ 0.97.

View Notebook

Scaler Clustering

  • Built an end-to-end unsupervised machine learning pipeline to segment employees based on compensation patterns using KMeans++ clustering, enabling data-driven workforce and salary insights.
  • Performed extensive data cleaning, outlier handling, and feature engineering on large-scale salary data (~200K records), ensuring model stability and realistic cluster formation.
  • Improved clustering quality significantly by optimizing feature selection, achieving a Silhouette Score of 0.81 through the use of job position–level encoding instead of broad job categories.
  • Identified and labeled three actionable employee segments — Entry / Low CTC, Mid-level Professionals, and High Earners / Leaders — providing clear business interpretation of compensation structures.
  • Designed the solution with production readiness in mind, including scalable preprocessing, clustering inference logic, and a clear path for Flask-based deployment for real-world usage.

View Notebook

OLA Drivers Churn Prediction

  • Developed an end-to-end ML churn prediction pipeline including data cleaning, EDA, feature engineering, SMOTE class balancing, and robust model evaluation.
  • Trained and compared 7 classification models (Logistic Regression, Random Forest, SVM, GBDT, XGBoost, LightGBM) using cross-validation and hyperparameter tuning (GridSearchCV & RandomizedSearchCV).
  • Selected Gradient Boosting (GBDT) as the final model, achieving Recall = 0.935 and F1-score = 0.886, prioritizing churn detection accuracy.
  • Applied SHAP explainability to interpret model predictions and identify key churn drivers such as low quarterly ratings, low business value, and driver grade.
  • Delivered actionable business recommendations, including performance-based incentives, rating system improvements, and city-level retention strategies to reduce driver attrition.

View Notebook