Data Quality: How to Detect Duplicate Records

Originally published 16 June 2010

Duplicate database entries are a pernicious problem. They come in at least two flavors. Sometimes merging of customer records in a data warehouse is compromised. What happens is that some assets are represented in one record and some of the assets are covered in another record. Then both records (pertaining to the same customer) underrepresent this customer’s real value to the business, because they both only provide a partial view. The other flavor, and the kind of duplicates this article will deal with, occurs when two records basically represent the exact same information. However, due to some error this information has populated two (or more) distinct rows.

Now if duplicate records were real duplicates, we wouldn’t have much of a problem. In reality, what we call “duplicate records” usually refers to the same entity in the real world, but in the database this entity shows up as two slightly different cases. These two records are suspiciously similar, however, determining which records are a “true” match and which are actually different is our challenge here.

Deduplicating always involves multiple fields because two people could have identical names, such as John Smith, and people in the same household have the same address. There could be minor typographical errors in some of the fields, use of “special” characters, middle initials present or absent or there could be missing information. All this mishap leads to minor differences, yet records should still be identified as duplicates, when they really are.

Large lists are often the result of merging multiple files, each with their own  peculiarities. Some, for instance, may contain a telephone number and others may not. Some attributes are prone to change, like age or address, and the data might have been captured at different points in time. So the same customer could show up with two different addresses.

The way fields are captured may differ across supplying lists. If, in one list, first and last name were separate fields and in another they were recorded together, often you will concatenate the two fields in the source that had separate fields before merging. Some attributes might not have been recorded in some sources, so they will need to stay missing, etc.

Typical problem fields are name and address. There are two reasons these are problematic. First, these fields play a crucial role in positively identifying matches. The other reason is that address and name fields often contain multiple data elements like first and last name, initials, titles, etc. All of these could have been recorded together in the “name” field. Similar for address fields which are notoriously difficult to standardize. These problems are exacerbated considerably for international lists.

Detecting Duplicates: The Classification Challenge

For sake of reasoning, we can ignore cases where two records are completely  identical. This scenario is easy, but unfortunately rare. Another, slightly less trivial problem is where the database is indexed by a (supposedly) unique identifier, but when you do a count distinct, it turns out that some of the IDs aren’t unique after all. For the purpose of this article, we will consider the question of duplicate IDs: are these indeed matches or are they different?

What we are interested in is records that are suspiciously similar. Barring a “golden” match field, namely the ID, you need to look at several fields, typically in conjunction with each other. That is the interesting challenge we’ll discuss in this article.

The task of determining match/no-match needs to be automated because manual checking is too expensive (and error prone) for most practical applications. Additionally, we’d like to get some measure of “success” in finding duplicates. Ideally you want the process to output empirical estimates of the number of type I and type II errors. That means that we get a reliable (hopefully unbiased) estimate of the number of false matches and the number of undetected duplicates remaining.

Obviously when two records are not exactly identical, there is always the risk that the assessment of which records are duplicates turns out wrong. We have classical type I and type II errors. Either you decide that two records are indeed duplicates when in reality they are not, or you fail to detect duplicates that should have been identified as such. Matching  algorithms ideally should take both risks and associated costs into account when assessing trade-offs.

Detecting Duplicates: The Computational Challenge

In a list with N rows, there are (N (N – 1))/2 possible pairs of records to compare. Since this problem grows exponentially with the size of the list, we are looking at a computational challenge. When you identify a likely duplicate, some manual process will often be required, to assess a positive duplicate and decide on a permanent removal of the entry. In order to manage these processes, efficiency and reliable estimates are key.

To deal with the computational complexity we need to partition the list into mutually exclusive and exhaustive blocks. The purpose of this partitioning is to increase the number of duplicates within a block and, at the same time, decrease the number of record pairs you need to compare. Of course the question then arises: how do you decide on “the best” blocking variables? What are important attributes for a blocking variable or set of variables (fields)? Bear in mind that we need to take into account the possibility that the value of the blocking field itself could be wrong. And since we only compare records within the subset of records with identical values for this field, that means we would fail to detect any duplicate pair if the value of the blocking field itself is wrong (or inadvertently missing). Records which disagree on the value for the blocking variable will always be classified as
non-matches. So therefore one of the requirements for a potential blocking field is that it should be accurate.

A second requirement for a good blocking variable is that it should contain a large number of possible values that, preferably, should be fairly uniformly distributed as well. When a list has 10.000 records, we have 49.995.000 (10,000 9,999/2) comparisons to make. If half were men and half were women and we’d use that as our blocking variable, there are still 2  12.497.500 (5.000 4.999/2) = 24.995.000 comparisons to make. If these 10.000 records were evenly distributed across 50 states, there are only 50  19.900 (200 199/2) = 995.000 comparisons to make. Even better would be a ZIP code or some other variable (telephone number, etc.) with many unique values.

Phonetic coding systems derive a standardized code based on the way a name is pronounced. These are quite useful in dealing with spelling errors. One of their most powerful applications is as blocking fields. After a phonetic code has been generated, you still retain the original name field to verify (in conjunction with other fields) whether you have hit a true match.

Detecting Duplicates: The Statistical Challenge

One of the practical ways to estimate the total number of duplicates in a list is by employing capture-recapture techniques. These are essentially adopted from practices that are used to estimate wildlife populations. You capture some species, earmark them before release, and when you do a second capture count the number of earmarked animals in the second sample. That number is then used to estimated the total population. By the same token, you draw a sample and count the number of duplicates. Then you draw another sample and identify duplicates again. By matching these two lists you get a count, so to speak, of the number of earmarked animals. The Lincoln-Peterson capture-recapture estimator allows you to arrive at an estimate for the total number of duplicates in the list using two samples.

For cases where you have more than samples, a log-linear model can be employed. Note that the estimates for both the LP-estimator as well as log-linear models assume ndependent samples. This can be a bit tricky, as even “innocent” combinations like gender and age can turn out to be biased, for instance because women tend to grow older.

By employing an intelligent blocking strategy, you arrive at (much) higher proportions of duplicates than a random sample would provide. This has the advantage of being both very efficient in terms of using your resources as well as giving more reliable statistical estimators.

Iteratively Removing Duplicates

A benefit of using multiple iterations with several different blocking variables, is that you give yourself the best possible chance to acquire empirical data on where “true” duplicates are being identified, and where “false” matches occur. This helps you direct your (manual) resources accordingly. With the Lincoln-Peterson formula, you get an estimate of the total number of duplicates in the list. Another benefit of iteratively removing duplicates is that once you have this total number of duplicates, you can decide after each pass how much more effort you are willing to put in to find the last N remaining duplicates. That effort tends to go up considerably for the last few duplicates, and often at the risk of identifying false matches as true duplicates.

This is where the power of an iterative blocking strategy has a chance to shine. Each pass through the data, using different (sets of) blocking fields, allows you focus on different fields to establish an automated definite match. Once duplicates are removed, you will switch blocking strategies again. For each set of blocking fields, you employ different sets of matching fields to decide on a positive match (see next section). Needless to say these choices are governed by the number of type I and type II errors encountered during
(manual) checks.

A great place to start is the name field itself, but used as a blocking variable. To cope with possible misspellings, you derive a phonetic coding as blocking variable. Because there are so many different values for names, it often provides an excellent, highly efficient first pass. Having caught a great many “easy” duplicates, you can now concentrate your effort on the “hard” records.

Which Variables to Use for Deduplicating?

Since most large lists are the result of merging multiple sources, it can be  helpful to use the record source code. This will carry meaning, for instance, like telling which fields are expected to be missing, simply because they weren’t available in that particular list. But even when the source of a record is given, it is non-trivial to decide which fields must provide a positive match to call two records a duplicate. Here the input from domain experts is of paramount importance (and so are good meta data), as they are most likely familiar with the procedures used for collecting the data and compiling the lists.

The choice of variables for deduplicating is closely related to the blocking strategy. When you use more fields, you will find less matches. If you can, try to standardize address fields, by parsing them into multiple columns, each with their own content. This will allow for a more flexible blocking strategy.

Another tactic is to use string comparator metrics like Jaro or others (typically in the 0-1 range), and to set a threshold, above which the text field (usually name or address) is considered a match. If you combine name with many fields, you can lower this threshold, if you combine name with few fields, you will need to be more restrictive in order to avoid calling two records a “match” when they really are distinct.

Managing (limiting) the number of duplicates in your lists is driven both by cost considerations (eliminating waste), as well as by managing reputation risks. Not only is it a shame to send the same (or worse: conflicting) message more than once to the same customer, but it also doesn’t look very professional. In some cases, contacting the “wrong” customer can even have legal or compliance implications.

Detecting (and erasing) duplicate records is both art and science. The “art” has to do with relating your blocking strategy to your choice of matching fields. Another part of the fine art is carefully managing where you decide to employ your manual resources. To efficiently manage large lists, you need to minimize manual labor. You will want to employ resources to get an efficient assessment of the number of type I and type II classification errors, as these are key numbers for the business to know.

Because the classification problem grows exponentially with the number of records, there is a computational challenge you need to manage. The statistical challenge has to do with estimating the total number of duplicates in your list, and doing this accurately yet efficiently. This will allow you to guide your resource allocation, because as always, an 80/20 rule applies. The first duplicates are easiest to detect, and as you progress it becomes more difficult and more expensive.

An iterative approach to finding your duplicates works so well because it allows you to assess at each pass how many additional records you found, what the classification risks are (removing records that weren’t real duplicates), and how much it will cost to go after the next ones. Using a sound methodology, and relying on statistics it is possible to
accompany your deduplication work with solid estimates of costs/benefits. Because as interesting and fascinating as this work is, in the end all the business wants to know is: “what’s in it for me?” It is up to us as data quality professionals, to answer that question with valid and accurate, business driven assessments.

References:

  1. Thomas Herzog, Fritz Scheuren & William Winkler.  Data Quality and Record Linkage Technique. (2007) ISBN# 0387695028
  2. Dorian Pyle. Data Preparation for Data Mining. (1999) ISBN# 1558605290
  • Tom BreurTom Breur
    Tom Breur, Principal with XLNT Consulting, has a background in database management and market research. For the past 10 years, he has specialized in how companies can make better use of their data. He is an accomplished teacher at universities, MBA programs and for the Certified Business Intelligence Professional (CBIP) program. He is a regular keynoter at international conferences.  Currently,he is a member of the editorial board of the Journal of Targeting, the Journal of Financial Services Management and Banking Review. He acts as an advisor for The Council of Financial Competition and the Business Banking Board and was cited among others in Harvard Management Update about state-of-the-art data analytics. His company, XLNT Consulting, helps companies align their IT resources with corporate strategy, or in plain English, he helps companies make more money with their data. For more information you can email him at tombreur@xlntconsulting.com or call +31646346875.

     

Recent articles by Tom Breur



 

Comments

Want to post a comment? Login or become a member today!

Be the first to comment!