It is currently the holiday break, which means two things. First, almost everybody is taking time off, which means that (second) there is a little bit of breathing room for us to sit and ponder issues pushed into the background during the rest of the year. One of those items has to do with data quality scorecards, data issue severity, and setting levels of acceptability for data quality scores.
Essentially, if you can determine some assertion that describes your expectation for quality within one of the commonly used dimensions, then you are also likely to be able to define a rule that can validate data against that assertion. Simple example: the last name field of a customer record may not be null. This assertion can be tested on a record by record basis, or I can even extract the entire set of violations from a database using a SQL query.
Either way, I can get a score, perhaps either a raw count of violations, or a ratio of violating records to the total number of records; there are certainly other approaches to formulating a "score," but this simple example is good enough for our question: how do you define a level of acceptability for this score?
The approach I have been considering compares the relative financial impact associated with the occurrence of the error(s) against the various alternatives to address them. On one side of the spectrum, the data stewards can completely ignore the issue, allowing the organization to absorb the financial impacts. On the other side of the spectrum, the data stewards can invest in elaborate machinery to not only fix the current problem, but also ensure that it will never happen again. Other alternatives fall somewhere between these two ends, but where?
To answer this question, let's consider the economics. Ignoring the problem means that some financial impact will be incurred, but there is no cost of remediation. The other end of the spectrum may involve a signficant investment, but may address issues that occur sporadically, if at all, so the remediation cost is high but the value may be low.
So let's consider one question and see if that helps. At some point, the costs associated with ignoring a recurring issue equal the cost of preventing the impact in the first place (either by monitoring for an error or preventing it altogether). We can define that as the tolerance point - any more instances of that issue suddenly make prevention worth while. And this establishes one level of acceptability - the maximum number of errors that can be ignored.
Calculating this point requires two data points: the business impact cost per error, and the cost of prevention. The rest is elementary arithmetic - subtract the prevention cost from the business impact cost, and if you end up with a positive number, then it would have been worth preventing the errors.
My next pondering: how can you model this? More to follow...
Posted December 30, 2008 11:45 AM
Permalink | No Comments |