Customer Churn Prediction - Banking Industry

🚀 Executive Summary

Problem

The bank was grappling with high customer churn, particularly in high-value customer segments. Despite a growing customer base, the bank struggled to predict which customers were at risk of leaving.

Action

Developed machine learning models (Random Forest with SMOTE) and built a Streamlit web app for real-time, interactive predictions of at-risk customers.

Result

Achieved 84.67% accuracy, identified high-risk segments, and potentially saved €1.5 million in annual revenue by targeting high-balance customers.

Try the Prediction App

🎯 Problem Statement

Context

Customer churn in banking severely impacts profitability, especially with high-value customers. Without clear prediction methods, retention efforts were inefficient and misdirected.

Core Issue

The bank lacked an effective strategy to predict at-risk customers, leading to unnecessary resource allocation while missing valuable high-risk customers.

Key Questions

Which customer segments have the highest churn rates?
What are the primary factors driving churn?
How can the bank prioritize retention for high-value customers?

📈 Objectives & Key Metrics

Objective	Metric Tracked	Result Achieved
Identify churn predictors	Accuracy, AUC-ROC	Achieved 84.67% accuracy with Random Forest
Prioritize retention for high-risk customers	Churn rate, false positives	Reduced false positives to 244, improving retention targeting
Improve model precision	F1-Score	Achieved an F1-Score of 0.8969 after model calibration

📂 Data Overview

Data Sources

The dataset consists of 10,000 rows representing bank customers, sourced from Kaggle.

Key Variables

1 Credit Score: The customer's credit score
2 Balance: The account balance of the customer
3 Number of Products: Number of banking products held
4 Exited: Target variable indicating churn (1) or not (0)

Data Challenges

Class Imbalance: Churned customers represented only 20% of the dataset. SMOTE was applied to generate synthetic samples of the minority class.

Data Distribution

Correlation Matrix

🔧 Methodology

Data Cleaning

Standardization: Categorical variables encoded with one-hot encoding
Normalization: Continuous features scaled using StandardScaler

Analysis Techniques

Random Forest Classifier with GridSearchCV for hyperparameter tuning
SMOTE for class imbalance correction

Tools

Python with scikit-learn, SMOTE, GridSearchCV
Streamlit for interactive web app
Power BI for visualization

Model Development Timeline

Data Collection & Cleaning

Gathered 10,000 customer records and performed standardization/normalization

Exploratory Analysis

Identified key patterns and correlations in the data

Model Selection

Evaluated multiple algorithms and selected Random Forest

Hyperparameter Tuning

Optimized parameters using GridSearchCV

Deployment

Built Streamlit app for real-time predictions

📊 Model Selection and Evaluation

Model Performance Comparison

Model	Accuracy	Precision	Recall	F1-Score
Logistic Regression	70.21%	68%	66.7%	-
Decision Tree	75.50%	74%	72.5%	-
Random Forest	84.30%	82.4%	87.5%	85.0%
XGBoost	84.79%	83.5%	88.2%	85.7%
ANN	79.76%	-	75.5%	-
KNN	75.55%	-	-	73.4%

Final Model Selection

Based on evaluation, Random Forest was selected for its balanced performance with 84.30% accuracy, excellent recall (87.5%), and good precision (82.4%).

Hyperparameter Tuning

Using GridSearchCV, the following optimal parameters were selected:

n_estimators: 300
min_samples_split: 2
min_samples_leaf: 1
max_depth: None

Model Performance

The final model achieved 84.67% accuracy and F1-score of 0.8969, optimizing the trade-off between precision and recall.

💡 Key Insights

High Churn in Middle-Aged Customers

What: Middle-aged customers (45–64 years) showed the highest churn rates.

So What: The bank risks losing a significant portion of its customer base if proactive measures aren't taken.

Account Balance is a Strong Predictor

What: Customers with higher account balances (>€200K) exhibited the highest churn likelihood.

So What: Focusing on high-balance customers is crucial for retention, potentially saving €1.5 million annually.

Geography Matters

What: France shows a higher churn rate compared to Germany and Spain.

So What: Tailored retention strategies specific to France can help reduce churn in this important market.

✅ Recommendations & Business Impact

1

Launch Retention Campaign for High-Balance Customers

Prioritize retention efforts for customers with balances above €200K by offering loyalty rewards and personalized services.

Potential Value: 10% churn reduction could save €1.5 million annually

2

Focus on Middle-Aged Customers (45–64) with Tailored Incentives

Target this demographic with personalized incentives like lower fees and exclusive investment offers.

Potential Value: 5% reduction in churn could save €500,000 annually

3

Region-Specific Retention Campaigns for German Customers

Implement France-focused retention campaigns with region-specific offers.

Potential Value: 8% churn reduction in France could save substantial revenue

4

Refine Model Calibration to Reduce False Positives

Optimize the model to reduce the 244 false positives, improving resource allocation.

Potential Value: Save 15% in retention-related costs

5

Implement Real-Time Churn Prediction for Proactive Retention

Monitor customer behaviors and predict churn in real-time to enable timely interventions.

Potential Value: 5% reduction in churn could save €500,000 annually

Business Impact Summary

Priority	Recommendation	Expected Impact	Owner
High	Retention campaign for high-balance customers	Reduce churn by 10% in high-value segments	Marketing Team
Medium	Focus on middle-aged customers	Reduce churn by 5% in this demographic	Customer Success
High	Region-specific campaigns for France	Reduce churn by 8% in France	Marketing/Regional Teams
Low	Refine model calibration	Save 15% in retention costs	Data Science Team
High	Real-time churn prediction	Reduce churn by 5% through timely interventions	Technology & Operations

📉 Caveats & Next Steps

Caveats

Limited Features

Current model uses demographic and account data. Adding behavioral features could enhance accuracy.

Class Imbalance

SMOTE was applied but model may still be slightly biased toward majority class.

External Factors

Model trained on historical data may not account for current market shifts.

Regional Differences

Geographic differences considered but not deeply modeled.

Interpretability

Random Forests are powerful but non-interpretable.

Next Steps

1

Incorporate Behavioral Features

Add transaction frequency and customer interaction history.

2

Refine with Cost-Sensitive Learning

Use Balanced Random Forest to reduce bias toward majority class.

3

Retrain the Model Regularly

Re-train with updated data quarterly to adapt to new trends.

4

Deploy Region-Specific Strategies

Implement tailored strategies for high-risk regions like France.

5

Improve Model Interpretability

Use SHAP or LIME for explainable insights.

6

Implement Real-Time Prediction

Integrate predictions into real-time CRM systems.

📚 Analysis Files & Notebooks

GitHub Repository

Access all source code, notebooks, and data files:

View on GitHub

Jupyter Notebook

View the complete analysis and model development in a single notebook:

View Jupyter Notebook

📊 Dashboards

Explore interactive dashboards created for customer churn analysis and business insights.

Churn Overview Dashboard

This dashboard provides a summary of churn rates, key segments, and overall business impact.

Retention Strategy Dashboard

This dashboard visualizes retention strategies, segment targeting, and predicted savings.