Default on Loan Payment

Accurately assessing the risk of a credit card product can drastically affect the financial outcome of any banking business. And the decision making process for each user and each credit line relies on historical user behaviour data. The objective of the credit card issuer is to maximise the number of open credit lines while keeping the number of defaulters as low as possible, while having those users with higher risk on lower credit lines.

Dataset

This dataset is formed by historical data points for 30,000 users, from April to September, 2005. Columns contain user and historical payments information among others, and the target variable "default payment next month" flags those users that didn't paid next month statement. Explanatory variables contain personal information about the user and their current credit (gender, education, family status, and current credit limit) and information about the credit status for the previous 6 months (repayment status, amount of bill and amount of payment).

Use Case

The objective is to train a ML model that gives a default probability to each user in the subsequent month. This is a binary classification task, therefore F1-score is a good metric to evaluate the performance of this dataset as it weights recall and precision equally, and a good retrieval algorithm will maximize both precision and recall simultaneously.

Data Problems and Synthesized Solutions

Although this dataset can make a huge difference on the credit business' performance, it has some problems that complicate its usage. Luckily, Synthesized can solve these problems in a fast and intuitive way.

Privacy. This dataset contains personal information about users, making it difficult to work and share this dataset. In Synthesized we can generate a synthetic dataset that preserves statistical information (95% utility across multiple ML tasks compared to original data) in under 10 minutes, while removing all risk of non-compliance with data regulation such as GDPR, HIPAA and CCPA.
Imbalanced Dataset. This dataset contains only 6,636 out of 30,000 (22%) default users. This imbalance may heavily reduce performance of the model if not treated carefully. With Synthesized's Data Manipulation tool we can manipulate the output distributions of this column and generate a balanced dataset, being able to improve final model performance. Read more about the benefits of data rebalancing in our blog post.
Fairness and Biases. AI models can be unintentionally (and potentially illegal) discriminative to certain sensitive groups of people, if the underlying training data is biased. Synthesized can help assessing how biased a dataset is, finding where the biases are and flagging them to the user. Read more about discrimination by AI in our blog post.

References

This dataset is publicly available in UCI dataset repository as "Credit Card Payments"