Insurance Quote Conversion Project

Being able to target those customers that are interested in buying policies from an insurance company and that are more likely to buy the product can lead to an increase on the conversion rate. This quote conversion dataset representing the activity of a large number of customers who are interested in buying policies from an insurance company


This dataset (from "Homesite Quote Conversion" Kaggle competition) contains historical data of 67,504 users and the outcome. Each QuoteNumber corresponds to a potential customer and the target variable QuoteConversion_Flag indicates whether the customer purchased a policy. For this competition, the organizers have anonymized the information in the columns. There are 56 explanatory variables, and they include specific coverage information, sales information, personal information, property information, and geographic information.

Use Case

The objective is to train a ML model that returns the probability of a customer to accept the offered product. This is a binary classification task, therefore F1-score is a good metric to evaluate the performance of this dataset as itweights recall and precision equally, and a good retrieval algorithm will maximize both precision and recall simultaneously.

Data Problems and Synthesized Solutions

Although this dataset can make a huge difference on the insurance business' performance, it has some problems that complicate its usage. Luckily, Synthesized can solve these problems in a fast and intuitive way.

  • Privacy. This dataset contains personal information about users, making it difficult to work and share this dataset. In Synthesized we can generate a synthetic dataset that preserves statistical information (95% utility across multiple ML tasks compared to original data) in under 10 minutes, while removing all risk of non-compliance with data regulation such as GDPR, HIPAA and CCPA.
  • Imbalanced Dataset. There are only 12,742 out of 67,504 (18%) converted users in this dataset. This imbalance may heavily reduce performance of the model if not treaded carefully. With Synthesized's Data Manipulation tool we can manipulate the output distributions of this column and generate a balanced dataset, being able to improve final model performance. Read more about the benefits of data rebalancing in our blogpost.
  • Fairness and Biases. AI models can be unintentionally (and potentially illegal) discriminative to certain sensitive groups of people, if the underlying training data is biased. Synthesized can help assessing how biased a dataset is, finding where the biases are and flagging them to the user. Read more about discrimination by AI in our blogpost.


This dataset is available in "Homesite Quote Conversion" Kaggle competition.

Download the dataset here

Synthesized blog

Learn what we've been up to

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.