The decision making process for credit assignment can drastically affect the financial outcome of any banking business. And to be able to properly asses the risk of opening a credit line with a determined user, one must rely on historical user behaviour data. The objective of the credit offerer is to maximise the number of open credit lines while keeping the number of defaulters as low as possible.
This dataset contains historical data of 150.000 users and their credit performance. The target variable (SeriousDlqin2yrs) flags users that experienced 90 days past due delinquency or worse, and there are 10 explainable variables to predict this column. These columns contain behavioural information (such as the total balance on credit cards, monthly debt payments by gross income and number of open loans among others), personal information (age, monthly income, family status...) and number of times the user found themselves in financial difficulties in the past.
The objective is to train a ML model that gives a default probability to each user in a subsample of the dataset. The predictive algorithm has to be properly balanced on reducing default as much as possible while keeping the system not too strict, as the we want to maximize the number of opened credit lines. Therefore, ROC AUC score is a good metric as it contains properly balanced information about all types of prediction errors.
Data Problems and Synthesized Solutions
Although this dataset can make a huge difference on the credit business' performance, it has some problems that complicate its usage. Luckily, Synthesized can solve these problems in a fast and intuitive way.
- Privacy. This dataset contains personal information about users, making it difficult to work and share this dataset. In Synthesized we can generate a synthetic dataset that preserves statistical information (95% utility across multiple ML tasks compared to original data) in under 10 minutes, while removing all risk of non-compliance with data regulation such as GDPR, HIPAA and CCPA.
- Imbalanced Dataset. There are only 10,026 out of 150,000 (6.68%) default users in this dataset. This imbalance may heavily reduce performance of the model if not treated carefully. With Synthesized's Data Manipulation tool we can manipulate the output distributions of this column and generate a balanced dataset, being able to improve final model performance. Read more about the benefits of data rebalancing in our blog post.
- Nulls. Two columns in this dataset (MonthlyIncome and NumberOfDependents) contain nulls. This can be problematic when training ML models, as NaNs are usually not supported. Synthesized tool learns the structure of the dataset with and without nulls, and it is able to generate data with or without nulls, or even impute values to unknown rows.
- Fairness and Biases. AI models can be unintentionally (and potentially illegal) discriminative to certain sensitive groups of people, if the underlying training data is biased. In this case, features such as age and NumberOfDependents should not be used as discriminative features under current jurisdictions in the US, UK and EU. Synthesized can help assessing how biased a dataset is, finding where the biases are and flagging them to the user. Read more about discrimination by AI in our blog post.
This dataset is publicly available in GiveMeSomeCredit! Kaggle competition.