Over the last few quarters, Bank XYZ has seen a significant number of clients close their accounts or migrate to other institutions. As a result, quarterly revenues have suffered significantly and yearly revenues for the current fiscal year may suffer significantly, leading stocks to fall and market cap to fall by X%.
Objective: Can we construct a model that predicts consumers who will churn in the near future with reasonable accuracy?
Definition of churn: A customer having closed all their active accounts with the bank is said to have churned. Churn can be defined in other ways as well, based on the context of the problem. A customer not transacting for 6 months or 1 year can also be defined as to have churned, based on the business requirements
It is a binary classification problem, for a given data of customer we need to predict if a customer churns or not
Metric(s):
Data-related metrics : F1-score, Recall, Precision
Recall = TP/ (TP + FN)
Precision = TP/ (TP + FP)
F1-score = Harmonic mean of Recall and Precision
where, TP = True Positive, FP = False Positive and FN = False Negative
Dataset download link - https://s3.amazonaws.com/hackerday.datascience/360/Churn_Modelling.csv
Target column – Exited (0/1)
Steps performed
Univariate analysis
– Box plot and PDF of the numerical feature are analyzed
– Count plot for categorical features
Bivariate analysis
– Analysis of correlations of Numerical feature with target variables.
– Association of categorical features with target variables.
Missing Value and Outlier Treatment
As a rule of thumb, we can consider using:
Created some new features based on simple interactions between the existing features.
Linear association of new columns with target variables to judge importance can be seen from heatmap below:
Features shortlisted through EDA/manual inspection and bivariate analysis:
Age, Gender, Balance, NumOfProducts, IsActiveMember, the 3 country/Geography variables, bal per product, tenure age ratio
Using RFE (Recursive Feature Elimination) to check whether it gives us the same list of features, other extra features or lesser number of features
RFE using Logistic Regression Model – Important Features
RFE using Decision Tree Model – Important Features
Tree-based model performed best
Decision tree rule engine visualization
steps
Model Zoo: List of all models to compare/spot-check
Result of spot checking
LightGBM is chosen for further hyperparameter tuning because it has the best performance on recall metric, and it came close second when comparing using F1-scores.
After getting the final LGBM model we can do the error analysis of the predictions made by the model on training dataset. This error analysis can help us to check if we made any incorrect assumptions. It also helps us to do data correction if it is required
# Ideally the separation should be there at 0.5 threshold
# All the class 1 probability should be greater than 0.5
# Less intersection implies less error
Revisiting bivariate plots of important features
The difference in distribution of these features across the two classes help us to test a few hypotheses
Extracting the subset of incorrect predictions
All incorrect predictions are extracted and categorized into false positives (low precision) and false negatives (low recall)
We can see the prob of errors and try to tune the threshold to avoid the errors e.g. in the low precision case we can see that prob is near to 0.53 or 0.502 so if we increase the threshold to 0.55 then We can make correct prediction and hence achieve high precision
# Most of the prediction is between 0.3-0.4 for low recall errors
# here most of the prob is 0.6 so if we shift the threshold to 0.6 then we may get high precision
# Seeing both the plots we are making lot of errors in 0.4-0.6 regions
# if we just tune the threshold probably we can get better performance
# Taking threshold as 0.45 gives the better result as compared to 0.5
# Precision not decreased much and recall, and f1-score is increased as compared to 0.5
# We can change the threshold only when we are very sure that data distribution won’t change much, 0.5 is safe if we r not sure
Final model validation score
Here, we’ll use df_test as the unseen, future data
Test Score
Listing customers who have a churn probability higher than 70%. These are the ones who can be targeted immediately
Got 124 customers who can leave the bank with more than 70% probability
Based on business requirements, a prioritization matrix can be defined, wherein certain segments of customers are targeted first. These segments can be defined based on insights through data or the business teams’ requirements. E.g., Males who are an ActiveMember, have a CreditCard and are from Germany can be prioritized first because the business potentially sees the max. ROI from them.