Limitations of Current Data Quality Solutions and (Statistical) Data Quality Assessment

Whilst data is rapidly becoming the world’s most valuable resource, there are very limited ways of assessing and quantifying the value of any set of data.

Assessing the Value of Data and Data Quality

In order to assess the value of data, there is a crucial concept to bear in mind:

Data is a medium for communicating information.

With this in mind, it is more apparent that value of data is determined by:

1)    The information which it represents and that information’s impact on the key business metrics in the business.

2)    The degree to which the data is able to convey that information.

The second point is referred to as “Data Quality”. The first point is very contextually dependant and a separate issue altogether. If we separate the ideas of data quality and the quality of the information, the picture becomes much clearer. This way,  we can focus on the data quality in the sense of how well it embodies the information contained within.

Now, various frameworks such as ISO8000 and DAMA UK exist that attempt to provide a standard for data quality.

ISO 8000

ISO 8000 was proposed in 2002 and is a work in progress, set out to be the global standard for Data Quality and Enterprise Master Data.

Portability: Portable data protects the intellectual property in the data and allows it to be used across applications and computer systems.

Note that there already exists a great deal of tools to manage this aspect and also the actual design of data in enterprise solves this issue.

Meets requirements:  Quality is the degree to which something meets stated requirements. Quality data is data that meets stated data requirements.

Provenance: Knowing the source of data is a key characteristic to establishing trust in data.

In essence, this is a binary element to quality data: the source is either known or not.

Accuracy: Accuracy is a claim of the conformance to facts. Provenance is a prerequisite to any claims that data is accurate.

Completeness: Data is complete if all the parts of a data set are present. Whilst some elements of completeness are easy to check (e.g. missing values), there are others which are not so simple (e.g. gaps in the distribution)

DAMA UK

DAMA UK has six primary dimensions for data quality assessment: Timeliness, Uniqueness, Validity, Consistency, Accuracy, Completeness.

Limitations of DAMA and ISO 8000

It goes without saying key elements of ISO 8000 and DAMA are vague/poorly defined and detached from business performance metrics, hence rarely applicable to modern data strategies in banking and insurance sectors.

Data completeness is used quite frequently by data engineers and test engineers. Crucially, there is a rather fundamental observation that counting the number of missing values in data is a trivial task! A real question is how do we know how complete data is in conveying some information?

To illustrate this point, let us consider a dataset that conveys information about the edge of a circle. If we only have a small number of data points, it is unlikely that one would be able to understand that the data describes a circle. However, as the number of points describing the circle increased, it would provide more information up to a point where  it was quite clear what the data represents. Any additional points added to the set of data would be redundant.

Whilst assessing the marginal utility of additional data points describing a linear correlation (or the edge of a circle) may appear to be a straightforward task, it is not so simple when the dataset in question contains continuous data, categorical data, names, postal codes all related in some complicated manner.

Illustration to the concept that the value of data is determined by the information which it is able to convey and its impact of the key business metrics in the business.


Statistical Data Quality Assessment

At Synthesized, we’ve taken the initiative to introduce the first metric of the quality of data in terms of how well the data embodies the information it was set out to hold and its impact on the crucial business metrics in banking and insurance.


The Synthesized ML Core layer is able to form a “deep understanding” of the information that the data represents. As a consequence of this process, the platform is able to provide rapid insights (in under 10 minutes) into how well the data conveys information and we can calculate metrics that describe how complete the data is in terms of the information.

This functionality enables the platform to rapidly

  • Assess Data Completeness/Saturation/Redundancy
  • Profile the information within a dataset

These metrics can be used for a wide variety of datasets in fraud detection/AML, credit scoring, segmentation analysis in banking and insurance in an automated way.

With the full toolset provided by Synthesized, methods of automatically quantifying the quality and utility of data for modern data projects is now possible.

Learn More

Explore the performance of the data quality assessment in common industry-specific scenarios in banking and insurance and review the data insights produced by the Synthesized platform in detail by contacting our data strategy experts at simon@synthesized.io.

References

  1. Data quality: A statistical perspective https://www.sciencedirect.com/science/article/abs/pii/S1572312705000638
  2. Quality and Value of the Data Resource in Large Enterprises https://www.tandfonline.com/doi/abs/10.1080/10580530.2015.1044344
  3. The Six Dimensions of EHDI Data Quality Assessment: https://www.cdc.gov/ncbddd/hearingloss/documents/dataqualityworksheet.pdf

Related posts

Subscribe to our blog

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.