Housing Prices

Being able to accurately calculate the maximum price that the market is prepared to pay in real estate can be crucial to analyze if a certain house price is over or under priced. Using this dataset, one can train a regression model to calculate the optimal price of a house. This can help an investment fund to discover new opportunities or discard expensive deals, or an agent to understand if a managed property is under or over priced.

Dataset

This dataset  contains historical data of 134,000 real estate transactions in the UK from 1st January 1995 to 29th June 2017. The target variable (Price) indicates the price paid in the given transaction, and there are 9 explainable variables to predict this column, that contain information about the transaction (date and price) and the sold property (location, property type, whether is new construction, tenure type, etc).

Use Case

The objective is to train a ML regression model that generates more accurately the price of the property based on the rest of features. Being a regression model problem, metrics such as the coefficient of determination (R2) and the mean squared error (MAE) are used to evaluate the model.

Data Problems and Synthesized Solutions

Although this dataset can make a huge difference on real estate business' performance, it has some problems that complicate its usage. Luckily, Synthesized can solve these problems in a fast and intuitive way.

  • Privacy. Because of privacy issues, the source has dropped some important columns such as personal details of the buyer and seller, and detailed information about the location (exact address) of the property. In Synthesized we can generate a synthetic dataset that preserves statistical information (95% utility across multiple ML tasks compared to original data) in under 10 minutes, while removing all risk of non-compliance with data regulation such as GDPR, HIPAA and CCPA.
  • Imbalanced Dataset. This dataset contains some columns that are highly imbalanced (only 5.6% of Old/New == 'Y' and 22.3% of Duration == 'L'). This class imbalance may heavily reduce performance of the model for this subsample if not treated carefully. With Synthesized's Data Manipulation tool we can manipulate the output distributions of this column and generate more samples to balance this population. Read more about the benefits of data rebalancing in our blog post.

References

This dataset is publicly available in "UK Housing Prices Paid" Kaggle Dataset

Download the dataset here