Customer Churn Prediction

Banking Industry Insights & Retention Strategy

Python Streamlit Power BI SMOTE Random Forest
Banking Analytics

🚀 Executive Summary

Problem

The bank was grappling with high customer churn, particularly in high-value customer segments. Despite a growing customer base, the bank struggled to predict which customers were at risk of leaving.

Action

Developed machine learning models (Random Forest with SMOTE) and built a Streamlit web app for real-time, interactive predictions of at-risk customers.

Result

Achieved 84.67% accuracy, identified high-risk segments, and potentially saved €1.5 million in annual revenue by targeting high-balance customers.

🎯 Problem Statement

Context

Customer churn in banking severely impacts profitability, especially with high-value customers. Without clear prediction methods, retention efforts were inefficient and misdirected.

Core Issue

The bank lacked an effective strategy to predict at-risk customers, leading to unnecessary resource allocation while missing valuable high-risk customers.

Key Questions

  • Which customer segments have the highest churn rates?
  • What are the primary factors driving churn?
  • How can the bank prioritize retention for high-value customers?

📈 Objectives & Key Metrics

Objective Metric Tracked Result Achieved
Identify churn predictors Accuracy, AUC-ROC Achieved 84.67% accuracy with Random Forest
Prioritize retention for high-risk customers Churn rate, false positives Reduced false positives to 244, improving retention targeting
Improve model precision F1-Score Achieved an F1-Score of 0.8969 after model calibration

📂 Data Overview

Data Sources

The dataset consists of 10,000 rows representing bank customers, sourced from Kaggle.

Key Variables

  • 1 Credit Score: The customer's credit score
  • 2 Balance: The account balance of the customer
  • 3 Number of Products: Number of banking products held
  • 4 Exited: Target variable indicating churn (1) or not (0)

Data Challenges

Class Imbalance: Churned customers represented only 20% of the dataset. SMOTE was applied to generate synthetic samples of the minority class.

Class Imbalance Visualization

Data Distribution

Data Distribution

Correlation Matrix

Correlation Matrix

🔧 Methodology

Data Cleaning

  • Standardization: Categorical variables encoded with one-hot encoding
  • Normalization: Continuous features scaled using StandardScaler

Analysis Techniques

  • Random Forest Classifier with GridSearchCV for hyperparameter tuning
  • SMOTE for class imbalance correction

Tools

  • Python with scikit-learn, SMOTE, GridSearchCV
  • Streamlit for interactive web app
  • Power BI for visualization

Model Development Timeline

Data Collection & Cleaning

Gathered 10,000 customer records and performed standardization/normalization

Exploratory Analysis

Identified key patterns and correlations in the data

Model Selection

Evaluated multiple algorithms and selected Random Forest

Hyperparameter Tuning

Optimized parameters using GridSearchCV

Deployment

Built Streamlit app for real-time predictions

📊 Model Selection and Evaluation

Model Performance Comparison

Model Accuracy Precision Recall F1-Score
Logistic Regression 70.21% 68% 66.7% -
Decision Tree 75.50% 74% 72.5% -
Random Forest 84.30% 82.4% 87.5% 85.0%
XGBoost 84.79% 83.5% 88.2% 85.7%
ANN 79.76% - 75.5% -
KNN 75.55% - - 73.4%

Final Model Selection

Based on evaluation, Random Forest was selected for its balanced performance with 84.30% accuracy, excellent recall (87.5%), and good precision (82.4%).

Hyperparameter Tuning

Using GridSearchCV, the following optimal parameters were selected:

  • n_estimators: 300
  • min_samples_split: 2
  • min_samples_leaf: 1
  • max_depth: None

Model Performance

Model Performance

The final model achieved 84.67% accuracy and F1-score of 0.8969, optimizing the trade-off between precision and recall.

💡 Key Insights

High Churn in Middle-Aged Customers

What: Middle-aged customers (45–64 years) showed the highest churn rates.

So What: The bank risks losing a significant portion of its customer base if proactive measures aren't taken.

Churn by Age

Account Balance is a Strong Predictor

What: Customers with higher account balances (>€200K) exhibited the highest churn likelihood.

So What: Focusing on high-balance customers is crucial for retention, potentially saving €1.5 million annually.

Churn by Balance

Geography Matters

What: France shows a higher churn rate compared to Germany and Spain.

So What: Tailored retention strategies specific to France can help reduce churn in this important market.

Churn by Geography

✅ Recommendations & Business Impact

1

Launch Retention Campaign for High-Balance Customers

Prioritize retention efforts for customers with balances above €200K by offering loyalty rewards and personalized services.

Potential Value: 10% churn reduction could save €1.5 million annually

2

Focus on Middle-Aged Customers (45–64) with Tailored Incentives

Target this demographic with personalized incentives like lower fees and exclusive investment offers.

Potential Value: 5% reduction in churn could save €500,000 annually

3

Region-Specific Retention Campaigns for German Customers

Implement France-focused retention campaigns with region-specific offers.

Potential Value: 8% churn reduction in France could save substantial revenue

4

Refine Model Calibration to Reduce False Positives

Optimize the model to reduce the 244 false positives, improving resource allocation.

Potential Value: Save 15% in retention-related costs

5

Implement Real-Time Churn Prediction for Proactive Retention

Monitor customer behaviors and predict churn in real-time to enable timely interventions.

Potential Value: 5% reduction in churn could save €500,000 annually

Business Impact Summary

Priority Recommendation Expected Impact Owner
High Retention campaign for high-balance customers Reduce churn by 10% in high-value segments Marketing Team
Medium Focus on middle-aged customers Reduce churn by 5% in this demographic Customer Success
High Region-specific campaigns for France Reduce churn by 8% in France Marketing/Regional Teams
Low Refine model calibration Save 15% in retention costs Data Science Team
High Real-time churn prediction Reduce churn by 5% through timely interventions Technology & Operations

📉 Caveats & Next Steps

Caveats

Limited Features

Current model uses demographic and account data. Adding behavioral features could enhance accuracy.

Class Imbalance

SMOTE was applied but model may still be slightly biased toward majority class.

External Factors

Model trained on historical data may not account for current market shifts.

Regional Differences

Geographic differences considered but not deeply modeled.

Interpretability

Random Forests are powerful but non-interpretable.

Next Steps

1

Incorporate Behavioral Features

Add transaction frequency and customer interaction history.

2

Refine with Cost-Sensitive Learning

Use Balanced Random Forest to reduce bias toward majority class.

3

Retrain the Model Regularly

Re-train with updated data quarterly to adapt to new trends.

4

Deploy Region-Specific Strategies

Implement tailored strategies for high-risk regions like France.

5

Improve Model Interpretability

Use SHAP or LIME for explainable insights.

6

Implement Real-Time Prediction

Integrate predictions into real-time CRM systems.

📚 Analysis Files & Notebooks

GitHub Repository

Access all source code, notebooks, and data files:

View on GitHub

Jupyter Notebook

View the complete analysis and model development in a single notebook:

View Jupyter Notebook

📊 Dashboards

Explore interactive dashboards created for customer churn analysis and business insights.

Churn Overview Dashboard

Churn Overview Dashboard

This dashboard provides a summary of churn rates, key segments, and overall business impact.

Retention Strategy Dashboard

Retention Strategy Dashboard

This dashboard visualizes retention strategies, segment targeting, and predicted savings.