Vehicle Insurance Claim Prediction

This vehicle insurance dataset contains one year’s worth of information for insured vehicles. The response variable represents the amount of claims experienced for that vehicle in that year.


This dataset contains 209,240 insurance records. The target variable is a dollar amount of claims experienced for that vehicle in that year, and the explanatory variables contain information about the policy, on the vehicle (such as model and make, year and other miscellaneous vehicle characteristics), and a row and household identifier.

Use Case

In this case we have a continuous variable as a target, so it is a regression task. To evaluate the results of this competition the organizers used normalized Gini coefficient computed on 2008 data, given only the data from 2005 to 2007.

Data Problems and Synthesized Solutions

Although this dataset can make a huge difference on the insurance business' performance, it has some problems that complicate its usage. Luckily, Synthesized can solve these problems in a fast and intuitive way.

  • Privacy. This dataset contains personal information about users, making it difficult to work and share this dataset. In Synthesized we can generate a synthetic dataset that preserves statistical information (95% utility across multiple ML tasks compared to original data) in under 10 minutes, while removing all risk of non-compliance with data regulation such as GDPR, HIPAA and CCPA.
  • Imbalanced Dataset. There are only 1,604 out of 209,240 (0.76%) default users in this dataset. This imbalance may heavily reduce performance of the model if not treated carefully. With Synthesized's Data Manipulation tool we can manipulate the output distributions of this column and generate a balanced dataset, being able to improve final model performance. Read more about the benefits of data rebalancing in our blog post.
  • Fairness and Biases. AI models can be unintentionally (and potentially illegal) discriminative to certain sensitive groups of people, if the underlying training data is biased. Synthesized can help assessing how biased a dataset is, finding where the biases are and flagging them to the user. Read more about discrimination by AI in our blog post.


The data is available in the Kaggle competition "Allstate Claim Prediction Challenge".

Download the dataset here