Bank Churn

Being proactive, detecting in advance if a customer is planning to leave, and reacting in time to convince them to stay can result in a more satisfied customer base. Also, it can help understand your customers and why they like or dislike your business. This dataset can help a banking institution reduce churn and offer more tailored products to their customers.

Dataset

This dataset contains 10,000 records, each of it corresponds to a different bank's user. The target is ExitedTask, a binary variable that describes whether the user decided to leave the bank. There are row and customer identifiers, four columns describing personal information about the user (surname, location, gender and age), and some other columns containing information related to the loan (such as credit score, current balance in the user's account and whether they are an active member among others).

Use Case

The objective is to train a ML model that returns the probability of a customer to churn. This is a binary classification task, therefore F1-score is a good metric to evaluate the performance of this dataset as it weights recall and precision equally, and a good retrieval algorithm will maximize both precision and recall simultaneously.

Data droblems and Synthesized solutions

Although this dataset can make a huge difference on the banking institution's performance, it has some problems that complicate its usage. Luckily, Synthesized data generation tools can solve these problems in a fast and intuitive way.

  • Privacy. This dataset contains personal information about users, making it difficult to work and share this dataset. In Synthesized we can generate a synthetic dataset that preserves statistical information (95% utility across multiple ML tasks compared to original data) in under 10 minutes, while removing all risk of non-compliance with data regulation such as GDPR, HIPAA and CCPA.
  • Imbalanced dataset. There are 2,037 out of 10,000 (20%) users that decided to leave the bank in this dataset. This imbalance may heavily reduce performance of the model if not treaded carefully. With Synthesized's Data Manipulation tool we can manipulate the output distributions of this column and generate a balanced dataset, being able to improve final model performance. Read more about the benefits of data rebalancing.
  • Fairness and Biases. AI models can be unintentionally (and potentially illegal) discriminative to certain sensitive groups of people, if the underlying training data is biased. Synthesized can help assessing how biased a dataset is, finding where the biases are and flagging them to the user. Read more about discrimination by AI.

References

This dataset is publicly available in Kaggle's dataset "Predicting Churn for Bank Customers".

Download the dataset here