Originally published 21 July 2010
Measuring data quality is still in its infancy. This is one of the reasons why making solid business cases is so hard. Because our mastery of fundamental concepts and principles is relatively meager, for the moment we are limited to local and idiosyncratic measures. Herzog et al (2007): “… most current quantification approaches are created in an ad hoc fashion that is specific to a given database and its use.” And: “The construction of such metrics is a fertile area for future research.”
At present, we lack an overarching theory to derive universal quality measures. As Goethe said, “The first sign we don’t know what we are doing is an obsession with numbers.” Given these limitations then what is attainable in terms of measurement? Most professionals would agree that it does make a lot of sense to measure data quality. Preferably on a continuous basis. It’s hackneyed, but you can’t manage what you don’t measure.
When you define “quality” as value to some user, it inherently follows that quality needs to be measured relative to the benefits that user is getting from the data. To make numbers that represent “data quality” meaningful, we need a framework that relates quality of decision making to the quality of the data on which it is based. Then value to the data user (decision maker) is tied to information, rather than data, and it’s output (decisions) drives the value.
What Does the Literature Have to Say?
In his landmark work,1 English (1999) puts data quality in the total data quality (TQM) framework. English’s 2009 tome, Information Quality Applied,2 continues along this tradition. He goes to great lengths to tie breakdowns in primary business processes to data faults, and to monetize these. Examples might be sending duplicate (and hence wasted) mail packs, overstocking product that perishes or goes out of fashion, or missed sales as a result of stock-outs.
Given the historical roots of TQM in manufacturing, this bias towards primary process breakdown is quite natural. However, we live in a knowledge economy, and many areas that use lots of data have a difficult time linking the costs of primary processes to data errors. And many areas like medical, government, or non-profit, have a hard time attaching financial numbers to errors. Yet for them, data quality is just as important as anywhere else.
Herzog et al (2007)3 provide three kinds of data quality metrics: completeness, proportion of duplicates and the proportion of each data element that is missing. Completeness refers to the proportion of entities from the population that is represented on the database/list. There is always talk about the census coverage, for instance, which is supposed to be 100%, but isn’t really. Proportion of duplicates refers to over-representation of some entities that might appear more than once in a list. In particular when you acquire multiple lists for a marketing campaign, there is always a chance of drawing the same person more than once, which leads to waste if you mail the same person twice. This also doesn’t look very professional. Proportion of data elements missing is a measure that represents how many rows within each column are missing when they really should be there.
Arkady Maydanchik, in his book Data Quality Assessment,4 recommends monitoring data quality using scorecards. His proposed approach boils down to performing a post hoc data clean up that you subsequently use to derive business rules to which new, incoming data ought to conform.
Data quality rules can be derived at the field level, like counting the number of unexpected missings, or values that fall outside an acceptable range (like ages > 150). Rules can also be derived between fields like when gender is male, pregnancy can’t possibly be set to “yes,” etc. The number and relation between data elements can grow arbitrarily. The complexity of these business rules tends to grow as increasingly elaborate ways of validating results become available.
Continuous measurement of data quality goes a long way toward raising awareness. Like the Hawthorne studies have shown, and as you can see for instance in call centers, merely recording quality levels can in itself drive quality up. But how should you represent “quality”? What are valid and equitable measures?
The measures that Herzog et al and Maydanchik have proposed all lend themselves to monitoring. Although they lack a firm theoretical framework (like English’s approach), they are a pragmatic first step. But don’t be fooled, any number you report in a scorecard is in itself arbitrary, unless you can directly relate it to your bottom line or observable improvement in decision making.
It is important to include a range of metrics in scorecard. Not only address data quality at a wider scale, but also to avoid optimization on some narrow metrics at the expense of other quality drivers. The selection of your scorecard metric should be driven by holistic insight in drivers of business value. Don’t confuse determinants of success with their outcomes! There is simply no substitute for causal analysis here (e.g., structural equation modeling).
For want of an overarching data quality theory, we are limited for the moment to some idiosyncratic measures for data quality. There is nothing wrong with that, it symbolizes the current state of affairs. There is no point in suggesting scientific precision when you don’t really know what you should measure with scientific precision.
This isn’t necessarily a bad thing; it matches the current maturity of our profession. Measurement without the appropriate model to govern observations merely allows extrapolation. So although one could measure consecutive months and reasonably infer the next (assuming linearity), it doesn’t help you imbue those observations with meaning.
When possible, tying data quality errors to primary process breakdown is an elegant way to measure and monetize data quality. However, in many settings this is far from straightforward. In our knowledge economy so many people make data based decisions on a repetitive basis, yet a rock-solid business case can still be hard.
Some decision outcomes can only be assigned arbitrary values. What is it worth to improve decision making in support of choices about administering life saving drugs? What is the economic value of human lives? Steven Levitt wrote about this controversial topic in Freakonomics,5 but there are many other decisions where no undisputed relation to money seems possible.
Merely recording data quality on an ongoing basis has proven remarkably effective in raising awareness and hence improves data quality without any further actions taken.
If you decide to build a data quality scorecard, ensure that improving scores coincides with integral improvement in performance. To draw a parallel with consumer research: customer satisfaction is not a KPI. It’s the result of doing things right for customers, and more particularly things that are of importance to them.
Isolated data quality metrics in and of themselves don’t “automatically” lead to better performance. Smart professionals understand what gets measured, and what gets measured gets rewarded. Ensure these rewards are primary drivers for your business goals – achieving that kind of business alignment is easier said than done.
References:
SOURCE: How to Measure Data Quality: Metrics and Scorecards
Recent articles by Tom Breur
Comments
Want to post a comment? Login or become a member today!
Be the first to comment!