It is a shock to the organization doing data profiling for the first time when they discover the true state of the quality of data in their day-to-day operational systems. These organizations discover the ugly truth – their databases are full of bad data. So what exactly do we mean by bad data? We find that some data is missing. Some data is out of domain range – a living person is found to be 561 years old. We find that numeric data occasionally has character values. And so forth.
There is an interesting question related to all this bad data. That question is: How did the data get to be bad in the first place?
In fact, there are most likely many reasons why bad data gets into systems. Some of these reasons are:
Data is incorrectly edited by the program capturing the data. For example, suppose genders were being entered. The three allowable values for gender are M for male, F for female, and U for unknown. One day, the system is shipped to Mexico and the users enter “C” for caballero and “M” for mujer. This makes perfect sense to the Spanish speaking end user. But if “C” shows” up in a database used by English speaking users, they aren’t going to recognize “C” as a valid designation for a user. And worse, English speaking users recognize “M” as a male and Spanish speaking users recognize “M” as a female. Oops!
Over time, data values change. In 1995, a system in France collected data in francs. But the change to Euro came, and now the system collects information about Euros. Everything is fine until data from 1995 is added to (or compared to) data from 2005. The results of the comparisons are very misleading.
Inattention to data collected by transaction processing users can also be an issue. Suppose the department of motor vehicles has a system about who has a valid driver's license. While motor vehicle data is being collected, blood type data is also collected. One day, a person enters “T” for blood type. The problem is that “T” is not a valid blood type; but the clerk at the motor vehicles department does not catch the error because blood type is not relevant to the validity of a driver's license.
Systems are often merged together producing invalid data. Suppose two human resource systems are merged together. System A has a 9-digit key for a person. System B has a 7-digit key for a person. The data is easy enough to merge. The 7-digit identifier is merely read into a 9-digit format where filler is used to pad the seven digits. The problem is that there is overlap between these files. Some people end up having two identifiers – a 7-digit identifier and a 9-digit identifier. Merely merging the files has produced nothing but a mess.
There is no organizational standard to begin with. Suppose that management requires each department to submit a budget. The engineering department submits a budget based on the salaries of the engineers. The accounting department submits a budget based on the fully loaded salary of its employees. The fully loaded salary includes medical benefits, 401K contributions, stock plan benefits, overtime, and a variety of other add-ons. When these numbers arrive at management’s desk, it appears that people in accounting make a lot more money than people in engineering.
And these reasons for the poor quality of data are only the tip of the iceberg.
It seems to me that in the push for higher data quality, the emphasis is on finding and correcting the bad data after the fact. Now there is nothing wrong with trying to make data better and more usable. But trying to correct data without finding out why the data was bad in the first place seems to be like plugging the North Sea dike with your fingers. God help you when the 11th
leak in the dike occurs.
SOURCE: How Did Bad Data Get Into the System?
Recent articles by Bill Inmon