Many Faces of Missing Data

Originally published 20 January 2010

Missing data is a fact of life. And sometimes it causes serious disruption. Fortunately, there are often quite acceptable ways of dealing with missing data. The key lies in accurately discerning different flavors of “missing” that may exist and dealing with each kind accordingly.

In this article, I will present a simple framework to deal with missing data -  a simple 2 x 2 matrix. One axis has voluntary versus involuntary registration of data. This typically equates to, for instance, survey data versus electronic recording of activity as in a database (or data warehouse). On the other, axis we have rightfully or wrongful missing data. Is a field “allowed” to be missing? Based on these two criteria you decide how to deal with missing data.

Because missing data can be so problematic to deal with, we’d really like to avoid this problem altogether. How can we do that? And even if some missings remain inevitable, how can we mitigate disruptions to analysis and reporting?

What is “Missing,” Really?

A “missing” or “null” field can have several meanings. For the purpose of this paper we’ll define “null” as no value in a text field (not even an empty “space”) and either no value or a designated value1 (like the smallest possible integer) in a numeric field.

For now, I make a gross distinction between missings in surveys and missings in a database context. At a more philosophical level, this represents whether some voluntary action was required by subjects. Sometimes people need to actively “do” something, as in surveys, and sometimes they don’t, as in data warehouse extraction, transform and load (ETL).

In a survey context, missing could mean that a respondent has indicated a certain question was not applicable for him (or her). But of course, it could also mean they failed to answer the question. And it could mean they might or might not have answered the question, but the data entry typist missed their response.

In a database context, missing can likewise have multiple meanings. In his excellent treatment, Pyle (1999)2 distinguishes between “missing” and “empty” values. Missing values are nulls for which no value has been entered in the database but for which an actual value exists. For empty values, no real-world value can be supposed.

The distinction between “rightful” and “wrongful” missings is crucial. In a database context it can be tempting to ignore this distinction. When looking at a marketing database, you could say that a customer’s balance for a savings product she doesn’t own is “zero.” And in a way this is true. But it is a poor representation of reality. It is much more appropriate to represent this value as “empty,” using Pyle’s terminology.

We have found, time and time again, that ignoring these distinctions can wreak havoc for your predictive models. Also, reporting can become awkward or downright erroneous (see Thomsen, 2002)3. The way OLAP tools deal with these distinctions differs greatly among software providers and is far from trivial. The short message is: do not assume your tool will calculate derived measures accurately without verifying!

Another example might be derived variables that require some “minimal” amount of data to be calculated. Let’s suppose you want to calculate an average savings balance over the past 12 months. This value requires at least 12 months’ worth of history. If a customer has been with the bank less than 12 months, no value can be calculated. It could be imputed, though, if the analytic purpose warranted this. Of course this is not the same as the average of the months for which a balance was available.

Some purists might consider these distinctions flaws in data modeling, and I don’t object. What is important is that data is and remains a truthful representation of reality and that derived calculations which are based on these numbers (containing sometimes inevitable missings) remain accurate and auditable.

How to Deal with Missing Values

The way you deal with a missing value depends on the “flavor” of missing you are facing as well as on the analytic requirements you are negotiating. There are preventive and contingent actions. Sometimes you can avoid or minimize missings (preventive action). And after you are “stuck” with missings, the options at your disposal to deal with them (contingent action), depend on the context in which the data was acquired. And on how you plan to use (or analyze) the data.

A value that isn’t there, simply isn’t there. Don’t ever forget that. Although we may come up with elegant replacement algorithms, that’s still what they are: an artificial way to “guess” what the true value most likely was. There is no substitute for capturing data right in the first place. No algorithm, no matter how smart, will ever be as good as capturing data correctly at the outset. As is so often the case, an ounce of prevention may well be worth a pound of cure.

Preventing Missings in Survey Data

For surveys or data, you prevent missings by good questionnaire or form design. Much has been written about questionnaire design, in particular in the scientific community. Graham Rhind has written an excellent eBook on form design, more specifically internet forms. His treatment of, in particular, name and address data is simply outstanding. It is available for free on his site (www.grcdi.nl/book4.htm).

After you have done everything you can to enable your survey respondents or customers to provide the information you are seeking, apply a bespoke data capturing mechanism. For surveys this should include a dedicated code (unique!) to indicate the respondent didn’t provide an answer. For handwritten materials you may even want to use a separate code for entries that were too difficult to decipher (if at all).

When I was conducting a large manual reentry data quality project, I found that many entries that were changed, and during a subsequent pass4 changed again, were caused by sloppy handwriting (and/or ambiguous attempts to correct an entry). “Can’t read” is something different from “did not provide an answer.”

One of the reasons to break these different sources of “missing” into subcategories is also to distinguish between “respondent did not provide an answer” and data entry staff that missed a particular entry which then might still show up as a null value.

Preventing Missings in Databases

For databases (or data warehouses) one can avoid missings by carefully and judiciously designing your ETL. Much has been written about that already (e.g., Kimball and Caserta, 2004).5 You mitigate nasty effects of unavoidable missings by choosing an appropriate data modeling paradigm. It’s exactly in this area where a Data Vault (Linstedt et al 2008)6 really shines as an enterprise data warehouse (EDW) model. The reason for this is threefold:

  1. Many-to-many relations are universally applied throughout the model and, therefore, point to every (unexpected) missing.
  2. Registration of bad/missing data is preserved for eternity and, hence, remains available for (further) analysis.
  3. Reporting about missing data is extremely easy as this is an almost automatic “byproduct” of modeling itself.

Using a data vault, very simple queries can generate detailed reports about (wrongful) missings, and these reports are easily specified per data source. Being able to identify per source where issues arise in the business has an extremely powerful signaling effect. You can’t manage what you don’t measure.

After you migrate data from your EDW to a data mart, and you choose to replace missings, the result using either a data vault architecture or a Kimball data warehouse will be more or less the same for end users. The difference lies in the fact that a Data Vault (intrinsically) preserves a history of missings. Even the history of business rules that identify missings remains available for analysis. This can provide further insight to drive process improvement.

What Kimball qualifies as an “audit dimension” has the same functional purpose as the identification mechanism for missings in a data vault. The only difference is that a Data Vault preserves the history of missings (and replacement rules), and you can always specify these per data source.

Why Replace Missing Values?

The case for replacing (or imputing) missing values applies solely to wrongful missings. If a value is “allowed” to be missing, you usually want to make this distinction. Unfortunately, very few tools can deal with a numeric column where some of the records have an alpha entry like “n/a.” Until the time where this becomes standard practice, we are stuck with appending an additional column (typically Boolean) to indicate whether values in the target column are rightfully (“empty”) or wrongfully missing.

Sometimes it makes sense for estimation purposes, to impute values. This will allow reports to be run across the entire database without any problem, even when some parts of the data may be spotty.

Another reason to impute missing fields is because some tools can only include records that are “complete.” Regression or neural network procedures, for instance, require that all fields in a record are non-null. In particular, if you have many columns, it would be a pity if you had to drop all records, even if only one single field were absent. Decision trees, on the other hand, can deal elegantly with missing fields.

The problem you face when imputing missings has to do with bias: in real life, it is exceedingly rare that missing data occurs truly at random. When you get lucky, there is no significant relation with a target variable you are estimating. But even if there is significant association between the pattern in missings and the target value of your (predictive) model, you may still need to find an appropriate procedure for replacing missing values. Needless to say, your imputation procedure should not introduce bias (easier said than done).

Life isn’t perfect. Sometimes data is missing. There are many ways to prevent this from happening, which I have alluded to. Even so, install processes that mitigate the effects of missings, because your prevention may not be perfect.

Depending on the context, missing data may or may not be acceptable. When working with the data, it is crucial to understand which of these two situations applies.

Often you need to lump several different sources of missing field values together during analysis (or reporting). But make that transformation as late as possible. After you have mapped multiple values onto one, the distinction is lost forever. Analyzing different sources of missings can provide valuable information for process improvement (future enhancements). This holds equally for survey data and database missings.

When replacing missings, which is often necessary, it makes a lot of sense to indicate (with Boolean variables) which values are and are not allowed to be missing. Also, after replacement, you want to add an indicator that flags the records that actually were imputed. Since the objective of replacement is to improve the value of your data set as a whole, it is imperative the imputation algorithm does not add additional bias!


References:

  1. The reason why some databases use such a designated number is purely technical: in the old days they just couldn’t “digest” true null values, so a proprietary number was used to represent “no value.”
  2. Dorian Pyle. Data Preparation for Data Mining. 1999. ISBN# 1558605290
  3. Eric Thomsen. OLAP Solutions 2nd Edition. 2002. ISBN# 0471400300
  4. The reason we customarily perform multiple passes of manual data reentry is because we like to have an empirical estimate of the resulting data quality. Ironically, even after multiple passes through the same data, you still don’t arrive at 100% quality. You can get arbitrarily close, but not quite at 100%. It’s almost always valuable to know exactly how close to 100% you are getting.
  5. Ralph Kimball and Joe Caserta. The Data Warehouse ETL Toolkit. 2004. ISBN# 0764567578
  6. Dan Linstedt, Kent Graziano and Hans Hultgren. The New Business Supermodel – the Business of Data Vault Modeling. 2008. ISBN# 9781435719149
  • Tom BreurTom Breur
    Tom Breur, Principal with XLNT Consulting, has a background in database management and market research. For the past 10 years, he has specialized in how companies can make better use of their data. He is an accomplished teacher at universities, MBA programs and for the Certified Business Intelligence Professional (CBIP) program. He is a regular keynoter at international conferences.  Currently,he is a member of the editorial board of the Journal of Targeting, the Journal of Financial Services Management and Banking Review. He acts as an advisor for The Council of Financial Competition and the Business Banking Board and was cited among others in Harvard Management Update about state-of-the-art data analytics. His company, XLNT Consulting, helps companies align their IT resources with corporate strategy, or in plain English, he helps companies make more money with their data. For more information you can email him at tombreur@xlntconsulting.com or call +31646346875.

     

Recent articles by Tom Breur

 

Comments

Want to post a comment? Login or become a member today!

Be the first to comment!