Originally published 20 January 2010
Missing data is a fact of life. And sometimes it causes serious disruption. Fortunately, there are often quite acceptable ways of dealing with missing data. The key lies in accurately discerning different flavors of “missing” that may exist and dealing with each kind accordingly.
In this article, I will present a simple framework to deal with missing data - a simple 2 x 2 matrix. One axis has voluntary versus involuntary registration of data. This typically equates to, for instance, survey data versus electronic recording of activity as in a database (or data warehouse). On the other, axis we have rightfully or wrongful missing data. Is a field “allowed” to be missing? Based on these two criteria you decide how to deal with missing data.
Because missing data can be so problematic to deal with, we’d really like to avoid this problem altogether. How can we do that? And even if some missings remain inevitable, how can we mitigate disruptions to analysis and reporting?
-----
Related content from the BeyeNETWORK
Could you increase customer retention if you had clean data? Answer that question with the free Dirty Data Challenge offer.
Learn more about this no cost, no obligation offer.
-----
A “missing” or “null” field can have several meanings. For the purpose of this paper we’ll define “null” as no value in a text field (not even an empty “space”) and either no value or a designated value1 (like the smallest possible integer) in a numeric field.
For now, I make a gross distinction between missings in surveys and missings in a database context. At a more philosophical level, this represents whether some voluntary action was required by subjects. Sometimes people need to actively “do” something, as in surveys, and sometimes they don’t, as in data warehouse extraction, transform and load (ETL).
In a survey context, missing could mean that a respondent has indicated a certain question was not applicable for him (or her). But of course, it could also mean they failed to answer the question. And it could mean they might or might not have answered the question, but the data entry typist missed their response.
In a database context, missing can likewise have multiple meanings. In his excellent treatment, Pyle (1999)2 distinguishes between “missing” and “empty” values. Missing values are nulls for which no value has been entered in the database but for which an actual value exists. For empty values, no real-world value can be supposed.
The distinction between “rightful” and “wrongful” missings is crucial. In a database context it can be tempting to ignore this distinction. When looking at a marketing database, you could say that a customer’s balance for a savings product she doesn’t own is “zero.” And in a way this is true. But it is a poor representation of reality. It is much more appropriate to represent this value as “empty,” using Pyle’s terminology.
We have found, time and time again, that ignoring these distinctions can wreak havoc for your predictive models. Also, reporting can become awkward or downright erroneous (see Thomsen, 2002)3. The way OLAP tools deal with these distinctions differs greatly among software providers and is far from trivial. The short message is: do not assume your tool will calculate derived measures accurately without verifying!
Another example might be derived variables that require some “minimal” amount of data to be calculated. Let’s suppose you want to calculate an average savings balance over the past 12 months. This value requires at least 12 months’ worth of history. If a customer has been with the bank less than 12 months, no value can be calculated. It could be imputed, though, if the analytic purpose warranted this. Of course this is not the same as the average of the months for which a balance was available.
Some purists might consider these distinctions flaws in data modeling, and I don’t object. What is important is that data is and remains a truthful representation of reality and that derived calculations which are based on these numbers (containing sometimes inevitable missings) remain accurate and auditable.
The way you deal with a missing value depends on the “flavor” of missing you are facing as well as on the analytic requirements you are negotiating. There are preventive and contingent actions. Sometimes you can avoid or minimize missings (preventive action). And after you are “stuck” with missings, the options at your disposal to deal with them (contingent action), depend on the context in which the data was acquired. And on how you plan to use (or analyze) the data.
A value that isn’t there, simply isn’t there. Don’t ever forget that. Although we may come up with elegant replacement algorithms, that’s still what they are: an artificial way to “guess” what the true value most likely was. There is no substitute for capturing data right in the first place. No algorithm, no matter how smart, will ever be as good as capturing data correctly at the outset. As is so often the case, an ounce of prevention may well be worth a pound of cure.
For surveys or data, you prevent missings by good questionnaire or form design. Much has been written about questionnaire design, in particular in the scientific community. Graham Rhind has written an excellent eBook on form design, more specifically internet forms. His treatment of, in particular, name and address data is simply outstanding. It is available for free on his site (www.grcdi.nl/book4.htm).
After you have done everything you can to enable your survey respondents or customers to provide the information you are seeking, apply a bespoke data capturing mechanism. For surveys this should include a dedicated code (unique!) to indicate the respondent didn’t provide an answer. For handwritten materials you may even want to use a separate code for entries that were too difficult to decipher (if at all).
When I was conducting a large manual reentry data quality project, I found that many entries that were changed, and during a subsequent pass4 changed again, were caused by sloppy handwriting (and/or ambiguous attempts to correct an entry). “Can’t read” is something different from “did not provide an answer.”
One of the reasons to break these different sources of “missing” into subcategories is also to distinguish between “respondent did not provide an answer” and data entry staff that missed a particular entry which then might still show up as a null value.
For databases (or data warehouses) one can avoid missings by carefully and judiciously designing your ETL. Much has been written about that already (e.g., Kimball and Caserta, 2004).5 You mitigate nasty effects of unavoidable missings by choosing an appropriate data modeling paradigm. It’s exactly in this area where a Data Vault (Linstedt et al 2008)6 really shines as an enterprise data warehouse (EDW) model. The reason for this is threefold:
Using a data vault, very simple queries can generate detailed reports about (wrongful) missings, and these reports are easily specified per data source. Being able to identify per source where issues arise in the business has an extremely powerful signaling effect. You can’t manage what you don’t measure.
After you migrate data from your EDW to a data mart, and you choose to replace missings, the result using either a data vault architecture or a Kimball data warehouse will be more or less the same for end users. The difference lies in the fact that a Data Vault (intrinsically) preserves a history of missings. Even the history of business rules that identify missings remains available for analysis. This can provide further insight to drive process improvement.
What Kimball qualifies as an “audit dimension” has the same functional purpose as the identification mechanism for missings in a data vault. The only difference is that a Data Vault preserves the history of missings (and replacement rules), and you can always specify these per data source.
The case for replacing (or imputing) missing values applies solely to wrongful missings. If a value is “allowed” to be missing, you usually want to make this distinction. Unfortunately, very few tools can deal with a numeric column where some of the records have an alpha entry like “n/a.” Until the time where this becomes standard practice, we are stuck with appending an additional column (typically Boolean) to indicate whether values in the target column are rightfully (“empty”) or wrongfully missing.
Sometimes it makes sense for estimation purposes, to impute values. This will allow reports to be run across the entire database without any problem, even when some parts of the data may be spotty.
Another reason to impute missing fields is because some tools can only include records that are “complete.” Regression or neural network procedures, for instance, require that all fields in a record are non-null. In particular, if you have many columns, it would be a pity if you had to drop all records, even if only one single field were absent. Decision trees, on the other hand, can deal elegantly with missing fields.
The problem you face when imputing missings has to do with bias: in real life, it is exceedingly rare that missing data occurs truly at random. When you get lucky, there is no significant relation with a target variable you are estimating. But even if there is significant association between the pattern in missings and the target value of your (predictive) model, you may still need to find an appropriate procedure for replacing missing values. Needless to say, your imputation procedure should not introduce bias (easier said than done).
Life isn’t perfect. Sometimes data is missing. There are many ways to prevent this from happening, which I have alluded to. Even so, install processes that mitigate the effects of missings, because your prevention may not be perfect.
Depending on the context, missing data may or may not be acceptable. When working with the data, it is crucial to understand which of these two situations applies.
Often you need to lump several different sources of missing field values together during analysis (or reporting). But make that transformation as late as possible. After you have mapped multiple values onto one, the distinction is lost forever. Analyzing different sources of missings can provide valuable information for process improvement (future enhancements). This holds equally for survey data and database missings.
When replacing missings, which is often necessary, it makes a lot of sense to indicate (with Boolean variables) which values are and are not allowed to be missing. Also, after replacement, you want to add an indicator that flags the records that actually were imputed. Since the objective of replacement is to improve the value of your data set as a whole, it is imperative the imputation algorithm does not add additional bias!
References:
Recent articles by Tom Breur
Comments
Want to post a comment? Login or become a member today!
Be the first to comment!