Oops! The input is malformed! How to Measure Data Quality: Metrics and Scorecards by Tom Breur - BeyeNETWORK Belgium


Channel: Data Quality - Tom Breur RSS Feed for Data Quality - Tom Breur

 

How to Measure Data Quality: Metrics and Scorecards

Originally published 21 July 2010

Measuring data quality is still in its infancy. This is one of the reasons why making solid business cases is so hard. Because our mastery of fundamental concepts and principles is relatively meager, for the moment we are limited to local and idiosyncratic measures. Herzog et al (2007): “… most current quantification approaches are created in an ad hoc fashion that is specific to a given database and its use.” And: “The construction of such metrics is a fertile area for future research.” 

At present, we lack an overarching theory to derive universal quality measures. As Goethe said, “The first sign we don’t know what we are doing is an obsession with numbers.” Given these limitations then what is attainable in terms of measurement? Most professionals would agree that it does make a lot of sense to measure data quality. Preferably on a continuous basis. It’s hackneyed, but you can’t manage what you don’t measure.

When you define “quality” as value to some user, it inherently follows that quality needs to be measured relative to the benefits that user is getting from the data. To make numbers that represent “data quality” meaningful, we need a framework that relates quality of decision making to the quality of the data on which it is based. Then value to the data user (decision maker) is tied to information, rather than data, and it’s output (decisions) drives the value.

What Does the Literature Have to Say?

In his landmark work,1 English (1999) puts data quality in the total data quality (TQM) framework. English’s 2009 tome, Information Quality Applied,2 continues along this tradition. He goes to great lengths to tie breakdowns in primary business processes to data faults, and to monetize these. Examples might be sending duplicate (and hence wasted) mail packs, overstocking product that perishes or goes out of fashion, or missed sales as a result of stock-outs.

Given the historical roots of TQM in manufacturing, this bias towards primary process breakdown is quite natural. However, we live in a knowledge economy, and many areas that use lots of data have a difficult time linking the costs of primary processes to data errors. And many areas like medical, government, or non-profit, have a hard time attaching financial numbers to errors. Yet for them, data quality is just as important as anywhere else.

Herzog et al (2007)3 provide three kinds of data quality metrics: completeness, proportion of duplicates and the proportion of each data element that is missing. Completeness refers to the proportion of entities from the population that is represented on the database/list. There is always talk about the census coverage, for instance, which is supposed to be 100%, but isn’t really. Proportion of duplicates refers to over-representation of some entities that might appear more than once in a list. In particular when you acquire multiple lists for a marketing campaign, there is always a chance of drawing the same person more than once, which leads to waste if you mail the same person twice. This also doesn’t look very professional. Proportion of data elements missing is a measure that represents how many rows within each column are missing when they really should be there.

Arkady Maydanchik, in his book Data Quality Assessment,4 recommends monitoring data quality using scorecards. His proposed approach boils down to performing a post hoc data clean up that you subsequently use to derive business rules to which new, incoming data ought to conform.

Data quality rules can be derived at the field level, like counting the number of unexpected missings, or values that fall outside an acceptable range (like ages > 150). Rules can also be derived between fields like when gender is male, pregnancy can’t possibly be set to “yes,” etc. The number and relation between data elements can grow arbitrarily. The complexity of these business rules tends to grow as increasingly elaborate ways of validating results become available.

Data Quality Scorecards

Continuous measurement of data quality goes a long way toward raising awareness. Like the Hawthorne studies have shown, and as you can see for instance in call centers, merely recording quality levels can in itself drive quality up. But how should you represent “quality”? What are valid and equitable measures?

The measures that Herzog et al and Maydanchik have proposed all lend themselves to monitoring. Although they lack a firm theoretical framework (like English’s approach), they are a pragmatic first step. But don’t be fooled, any number you report in a scorecard is in itself arbitrary, unless you can directly relate it to your bottom line or observable improvement in decision making.

It is important to include a range of metrics in scorecard. Not only address data quality at a wider scale, but also to avoid optimization on some narrow metrics at the expense of other quality drivers. The selection of your scorecard metric should be driven by holistic insight in drivers of business value. Don’t confuse determinants of success with their outcomes! There is simply no substitute for causal analysis here (e.g., structural equation modeling).

For want of an overarching data quality theory, we are limited for the moment to some idiosyncratic measures for data quality. There is nothing wrong with that, it symbolizes the current state of affairs. There is no point in suggesting scientific precision when you don’t really know what you should measure with scientific precision.

This isn’t necessarily a bad thing; it matches the current maturity of our profession. Measurement without the appropriate model to govern observations merely allows extrapolation. So although one could measure consecutive months and reasonably infer the next (assuming linearity), it doesn’t help you imbue those observations with meaning.

When possible, tying data quality errors to primary process breakdown is an elegant way to measure and monetize data quality. However, in many settings this is far from straightforward. In our knowledge economy so many people make data based decisions on a repetitive basis, yet a rock-solid business case can still be hard.

Some decision outcomes can only be assigned arbitrary values. What is it worth to improve decision making in support of choices about administering life saving drugs? What is the economic value of human lives? Steven Levitt wrote about this controversial topic in Freakonomics,5 but there are many other decisions where no undisputed relation to money seems possible.

Merely recording data quality on an ongoing basis has proven remarkably effective in raising awareness and hence improves data quality without any further actions taken.

If you decide to build a data quality scorecard, ensure that improving scores coincides with integral improvement in performance. To draw a parallel with consumer research: customer satisfaction is not a KPI. It’s the result of doing things right for customers, and more particularly things that are of importance to them.

Isolated data quality metrics in and of themselves don’t “automatically” lead to better performance. Smart professionals understand what gets measured, and what gets measured gets rewarded. Ensure these rewards are primary drivers for your business goals – achieving that kind of business alignment is easier said than done.


References:

  1. Larry English (1999), Improving Data Warehouse and Business Information Quality: Methods for Reducing Costs and Increasing Profits. ISBN# 0471253839
  2. Larry English (2009), Information Quality Applied: Best Practices for Improving Business Information, Processes and Systems. ISBN# 047013477X
  3. Thomas Herzog, Fritz Scheuren & William Winkler (2007), Data Quality and Record Linkage Technique. ISBN# 0387695028
  4. Arkady Maydanchik (2007), Data Quality Assessment. ISBN# 9780977140022
  5. Steven Levitt & Stephen Dubner (2006), Freakonomics. ISBN# 0061234001
  6. Gerald Weinberg (1993), Software Quality Management, Volume 2 – First Order Measurement. ISBN# 0932633242

SOURCE: How to Measure Data Quality: Metrics and Scorecards

  • Tom BreurTom Breur
    Tom Breur, Principal with XLNT Consulting, has a background in database management and market research. For the past 10 years, he has specialized in how companies can make better use of their data. He is an accomplished teacher at universities, MBA programs and for the Certified Business Intelligence Professional (CBIP) program. He is a regular keynoter at international conferences.  Currently,he is a member of the editorial board of the Journal of Targeting, the Journal of Financial Services Management and Banking Review. He acts as an advisor for The Council of Financial Competition and the Business Banking Board and was cited among others in Harvard Management Update about state-of-the-art data analytics. His company, XLNT Consulting, helps companies align their IT resources with corporate strategy, or in plain English, he helps companies make more money with their data. For more information you can email him at tombreur@xlntconsulting.com or call +31646346875.

     

Recent articles by Tom Breur



 

Comments

Want to post a comment? Login or become a member today!

Be the first to comment!