Business Property Insurance Risk

Fire accounts for a significant portion of total property losses. High severity and low frequency, fire losses are inherently volatile, which makes modeling them difficult. This dataset enables more accurate identification of each policyholder’s risk exposure and the ability to tailor the insurance coverage for their specific operation.


This dataset contains 105,450 insurance records.The target variable is a transformed ratio of loss to total insured value, and the explanatory variable contain policy characteristics, information on crime rate, geodemographics, and weather.

Use Case

In this case we have a continuous variable as a target, so it is a regression task. To evaluate the results of this competition the organizers used the weighted Gini coefficient, where the weights are given by the var11 variable.

Data Problems and Synthesized Solutions

Although this dataset can make a huge difference on the insurance business' performance, it has some problems that complicate its usage. Luckily, Synthesized can solve these problems in a fast and intuitive way.

  • Nulls. Many columns in this dataset contain nulls. This can be problematic when training ML models, as NaNs are usually not supported. Synthesized tool learns the structure of the dataset with and without nulls, and it is able to generate data with or without nulls, or even impute values to unknown rows.
  • Imbalanced Dataset. This dataset contains many columns that are highly imbalanced (99.8% of insurance records had zero loss). This class imbalance may heavily reduce performance of the model for this subsample if not treated carefully. With Synthesized's Data Manipulation tool we can manipulate the output distributions of this column and generate more samples to balance this population. Read more about the benefits of data rebalancing in our blog post.


The data is available in the Kaggle competition "Liberty Mutual Group - Fire Peril Loss Cost"

Download the dataset here