Medical Cost Personal Dataset

To develop the best medical insurance products, the insurer need access to historical data to approximate the medical costs of each user. With this data, a medical insurer can develop more accurate pricing models, plan a particular insurance outcome, or manage a big portfolios. For all these cases, the objective is to accurately predict insurance costs.

Dataset

This dataset contains 1,339 medical insurance records. The individual medical costs billed by health insurance are the target variable charges, and the rest of columns contain personal information such as age, gender, family status, and whether the patient smokes among other features.

Use case

The objective is to train a ML regression model that generates the target column charges more accurately. Being a regression model problem, metrics such as the coefficient of determination and the mean squared error are used to evaluate the model.

Data problems and Synthesized solutions

This dataset can boost up the financial performance of a medical insurer, bit it has some issues that complicate its usage. Luckily, Synthesized data generation tools can solve these problems in a fast and intuitive way.

Privacy. This dataset contains personal information about users, making it difficult to work and share this dataset. In Synthesized we can generate a synthetic dataset that preserves statistical information (95% utility across multiple ML tasks compared to original data) in under 10 minutes, while removing all risk of non-compliance with data regulation such as GDPR, HIPAA and CCPA.
Fairness and biases. AI models can be unintentionally (and potentially illegal) discriminative to certain sensitive groups of people, if the underlying training data is biased. This dataset is especially sensitive, as it contains users medical records. Synthesized can help assessing how biased a dataset is, finding where the biases are and flagging them to the user. Read more about discrimination by AI.

References

This dataset is publicly available in Kaggle's Medical Cost Personal dataset.