Blog: David Loshin Subscribe to this blog's RSS feed!

David Loshin

Welcome to my BeyeNETWORK Blog. This is going to be the place for us to exchange thoughts, ideas and opinions on all aspects of the information quality and data integration world. I intend this to be a forum for discussing changes in the industry, as well as how external forces influence the way we treat our information asset. The value of the blog will be greatly enhanced by your participation! I intend to introduce controversial topics here, and I fully expect that reader input will "spice it up." Here we will share ideas, vendor and client updates, problems, questions and, most importantly, your reactions. So keep coming back each week to see what is new on our Blog!

About the author >

David is the President of Knowledge Integrity, Inc., a consulting and development company focusing on customized information management solutions including information quality solutions consulting, information quality training and business rules solutions. Loshin is the author of The Practitioner's Guide to Data Quality Improvement, Master Data Management, Enterprise Knowledge Management: The Data Quality Approachand Business Intelligence: The Savvy Manager's Guide. He is a frequent speaker on maximizing the value of information. David can be reached at loshin@knowledge-integrity.com or at (301) 754-6350.

Editor's Note: More articles and resources are available in David's BeyeNETWORK Expert Channel. Be sure to visit today!

Recently in Data Quality Category

It is currently the holiday break, which means two things. First, almost everybody is taking time off, which means that (second) there is a little bit of breathing room for us to sit and ponder issues pushed into the background during the rest of the year. One of those items has to do with data quality scorecards, data issue severity, and setting levels of acceptability for data quality scores.

Essentially, if you can determine some assertion that describes your expectation for quality within one of the commonly used dimensions, then you are also likely to be able to define a rule that can validate data against that assertion. Simple example: the last name field of a customer record may not be null. This assertion can be tested on a record by record basis, or I can even extract the entire set of violations from a database using a SQL query.

Either way, I can get a score, perhaps either a raw count of violations, or a ratio of violating records to the total number of records; there are certainly other approaches to formulating a "score," but this simple example is good enough for our question: how do you define a level of acceptability for this score?

The approach I have been considering compares the relative financial impact associated with the occurrence of the error(s) against the various alternatives to address them. On one side of the spectrum, the data stewards can completely ignore the issue, allowing the organization to absorb the financial impacts. On the other side of the spectrum, the data stewards can invest in elaborate machinery to not only fix the current problem, but also ensure that it will never happen again. Other alternatives fall somewhere between these two ends, but where?

To answer this question, let's consider the economics. Ignoring the problem means that some financial impact will be incurred, but there is no cost of remediation. The other end of the spectrum may involve a signficant investment, but may address issues that occur sporadically, if at all, so the remediation cost is high but the value may be low.

So let's consider one question and see if that helps. At some point, the costs associated with ignoring a recurring issue equal the cost of preventing the impact in the first place (either by monitoring for an error or preventing it altogether). We can define that as the tolerance point - any more instances of that issue suddenly make prevention worth while. And this establishes one level of acceptability - the maximum number of errors that can be ignored.

Calculating this point requires two data points: the business impact cost per error, and the cost of prevention. The rest is elementary arithmetic - subtract the prevention cost from the business impact cost, and if you end up with a positive number, then it would have been worth preventing the errors.

My next pondering: how can you model this? More to follow...


Posted December 30, 2008 11:45 AM
Permalink | No Comments |

One good thing about being busy is that you get opportunities to streamline ideas through iteration. My interest in data profiling goes pretty far back, and the profiling process is one that is useful in a number of different usage scenarios. One of these is data quality assessment, especially in situations where not much is known about the data; profiling provides some insight into basic issues with the data.

But in situations where there is some business context regarding the data under consideration, undirected data profiling may not provide the level of focus that is needed. Providing reports on numerous nulls, outliers, duplicates, etc. may be overkill when the analyst already knows which data elements are relevant and which ones are not. In these kinds of situations, the analyst can instead concentrate on the statistical details associated with the critical data elements as a way to evaluate the extent to which data anomalies might impact the business.

So in some recent client interactions, instead of just throwing the data into the profiler and hoping that something good comes out, we narrowed the focus to just a handful of data elements and increased the scrutiny on the profiler results, sometimes refining the data sets, pulling different samples, segmenting the data to be profiled, joined different data sets prior to profiling, all as a way to get more insight into the data instead of the typical reports telling me that yet another irrelevant data element is 99% null. The upshot is that a carefully planned process for driving the directed profiling process gave much more interesting results, both for us and for the client.


Posted December 23, 2008 1:51 PM
Permalink | No Comments |

I just got back from a few days at the DataFlux IDEAS 2008 Users conference, and it looks like there are some interesting things going on in Cary.

First off, I was invited to provide a tutorial on Data Governance on Monday afternoon, and there seems to be (as expected) a growing interest in operationalizing the data stewardship roles and monitoring more than just the quality of data, but also the performance aspects of the data stewradship activities as well. The ability to define and execute against data qualtiy service level agreements is an aspect of oversight that is gaining momentum.

These ideas were validated at yesterday's keynote talk by Ted Friedman from Gartner. In displaying the data management "hype curve," it seems that we are seeing the ascendancy of two data management activities that I covered: data quality dashboards and scorecards, and active metadata driving business processes.

Another interesting announcement was that SAS has opted to transition its data integration technology (and a significantly-sized support team) to DataFlux under the nom-de-plume "Project Unity" to unify the data integration and data quality/governance offering. This probably enhances DataFlux's ability to compete against those data integration vendors that have acquired data quality technologies.

Lots of good customer case studies also, which, in comparison to last year's set, seems to show maturation among their customer community's approaches to data quality management. Good show!


Posted October 8, 2008 10:21 AM
Permalink | No Comments |

Informatica's IAP presentation focused on the evolution of the data quality technology plus the capabilities obtained from the Itemfield acquisition into an "extra-enterprise" data governance and process orchestration offering.

A welcome and interesting trend is the introdution of business process modelng into the data management operations silo.


Posted July 2, 2008 9:41 AM
Permalink | No Comments |

I got a postcard from Verizon today. It said:

"We recently sent you a letter in which we advertised a Verizon bundle package of Verizon FiOS Internet and Verizon FiOS TV service. This letter was mailed by mistake and the services described in the letter have never been offered by Verizon under those terms.

We apologize for this error.

Verizon Consumer Marketing"

OK, seriously, I am finding it hard to get my head around this. The offer came in one of those pseudo-overnight envelopes that marketers often use to make their letter seem more credible - you know, cardboard weight with a zip-pull - not cheap. So this company:

- Drafts a marketing letter,
- Prints tens of thousands of copies,
- Custom prints tens of thousands of fancy cardboard envelopes,
- Puts them into fancy cardboard envelopes, and
- Mails them.

Actually, I am guessing about the number - it could be orders of magnitude greater, for all I know.

I find it hard to believe that the internal governance and control over marketing would not have stopped the process after the marketing letter had been drafted if it contained erroneous information, so I am curious as to what has really happened. I mean, in fact, having sent out the previous letter, the company actually did offer the services under those terms, but perhaps, due to some error, was not prepared to honor that offer.

In any event, my guess would be that there were some significant negative business impacts related to this bundle blunder - actual hard costs for materials and postage, as well as softer costs relating to organizational trust.


Posted January 24, 2008 1:35 PM
Permalink | No Comments |

This past week, 60 Minutes had a story on the (lack of) quality of the data on the national "No-Fly" list. Apparently (just as all of you thoughtful readers should have already expected), the list is rampant with names of dead people (including 14 of the 19 9/11 hijackers) and people unlikely to be traveling (e.g., Saddam Hussein and convicted and jailed terrorist Zacarias Mousaoui). In addition, the limited identifying information on the list causes increased scrutiny of those whose identities are falsely matched positively to names on the list.

Actually, the deficiencies in the quality of the data managed by the Terrorist Screening Center had already been discussed in a report issued by the Department of Justice over a year ago, in which a "major Quality Assurance effort" was put underway to "ensure that records of highest priority for correction are addressed by a record-by-record search."


Posted October 10, 2006 9:18 AM
Permalink | 1 Comment |

Yet another article on impacts of poor data quality in healthcare. Apparently, medication errors (e.g., incorrecly transcribed prescriptions) kill 7000 people and conservatively incur costs of $3.5 billion per year.


Posted July 21, 2006 2:12 PM
Permalink | 1 Comment |

I was just emailed a press release telling me that Sunopsis, a data integration tools vendor, is now partnering with Trillium to provide data quality tools integrated with their integration suite. This is a nice development considering that Sunopsis had entered into an agreement with Similarity Systems not too long before Similarity was acquired by Informatica, effectively quashing the Sunopsis deal.


Posted June 13, 2006 7:53 AM
Permalink | 3 Comments |

Last week Gartner released its "Magic Quadrant" for Data Quality Tools vendirs, as reported in this news item. I wonder, though, with all the consolidation going on, and the focus on value-added applications that need to embed data quality technology (e.g., MDM, CDI, CRM, SCM, and other three-letter acronyms), whether the concept of "data quality" tools may soon be outdated? If data quality is imperative to the success of any data-oriented business application, then quality concepts must be architected into the fabric of the application development environment.

My prediction: infrastructure companies (RDBMS, Enterprise Architecture, Data Modeling tools, Application Development, Metadata Management Repositories, etc.) will soon be incorporating parsing, standardization, and linkage as part of their offerings.


Posted May 4, 2006 7:05 AM
Permalink | 3 Comments |

A recent study suggests the costs of poor data quality to Dutch business exceeds €400 million yearly. According to the article, the results of a survey of 20,000 Dutch organizations found that "the total amount of €400 million consists of costs that are calculated based on directly quantifiable aspects, such as wrongly addressed invoices and product deliveries which do not arrive at the right addresses."


Posted April 18, 2006 6:53 AM
Permalink | 4 Comments |