Blog: David Loshin Subscribe to this blog's RSS feed!

David Loshin

Welcome to my BeyeNETWORK Blog. This is going to be the place for us to exchange thoughts, ideas and opinions on all aspects of the information quality and data integration world. I intend this to be a forum for discussing changes in the industry, as well as how external forces influence the way we treat our information asset. The value of the blog will be greatly enhanced by your participation! I intend to introduce controversial topics here, and I fully expect that reader input will "spice it up." Here we will share ideas, vendor and client updates, problems, questions and, most importantly, your reactions. So keep coming back each week to see what is new on our Blog!

About the author >

David is the President of Knowledge Integrity, Inc., a consulting and development company focusing on customized information management solutions including information quality solutions consulting, information quality training and business rules solutions. Loshin is the author of The Practitioner's Guide to Data Quality Improvement, Master Data Management, Enterprise Knowledge Management: The Data Quality Approach and Business Intelligence: The Savvy Manager's Guide. He is a frequent speaker on maximizing the value of information. David can be reached at loshin@knowledge-integrity.com or at (301) 754-6350.

Editor's Note: More articles and resources are available in David's BeyeNETWORK Expert Channel. Be sure to visit today!

August 2005 Archives

Believe it or not, even a savvy person like me is subject to fraud once in a while. Perusing our most recent credit card bill, I came across three charges that were clearly not ours - small charges made at gas stations in another state. When we called the credit card company, they determined that our card had been duplicated, since the charges had been swiped through a card reader. Apparently, at some point recently our credit card data must have been double-swiped through a magnetic card reader and then transferred to a duplicated card.

The duped card was then used for small-ticket purchases at innocuous locations intended to evade the bank's fraud detection algorithms. The pattern is the fraudsters pilot a couple of small charges, and if the account holder doesn't shut off the card, much larger charges are made.

Once the bank was made aware of the situation, they immediately cancelled the card and erased out charges. When asked about whether they would investigate the fraud, the customer service representative (CSR) said that they don't bother with these kinds of small amounts, but just write it off.

Ever wonder how much money is lost due to small-scale fraud? The CSR told us that $20,000,000.00 is written off each quarter! I think, though, that it would be possible to use BI techniques to track down this illegal behavior...

The kinds of charges that appeared were interesting: $33.10, $45.00, and $70.00. Of the 3 charges, only one was not a round-dollar amount, and the second and third charges were done at the same location almost at the same time.

Back in April, I wrote an article for B-EYE-Network on the use of Benford's Law for Information Analysis. In this article, I described a digital analysis phenomenon regarding the distribution of numeric digits in large data sets. The truth is, Benford Analysis has been used primarily as an auditing technique to look for fraudulent behavior , and I am confident that (with a little thought) a reasonable use of the technique could help in identifying transaction patterns involving different duplicated credit cards.

Individuals are likely to repeat their bad behavior, and even if they think that they are creating random dollar amount charge sequences, each may reflect a particular signature that identifies the perpetrator, and by analyzing the geographic density of the illicit charge locations, pinpoint a reasonable location to start to track down the offenders.

Anyone out there with experience in fraud detection using Benford's Law? What do you guys think?


Posted August 26, 2005 2:29 PM
Permalink | 37 Comments |

Is it better to clean data on intake or after it has been processed?

Let's say you have a data entry process in which names and addresses are input into a system. At some point within your processing, that same data (name and address) will be forwarded to an application performing a business process, such as printing a shipping label. However, it is not necessarily guaranteed that the individual whose name and address was input will ever be sent anything.

You desire to maintain clean data, and you are now faced with two options: cleanse the data at intake or cleanse it when it is used. There are arguments for doing both of these options...

On the one hand, a number of data quality experts advocate ensuring that the data is clean when it enters your system, which would support the decision to cleanse the data at the intake location.

On the other hand, since not all names and addresses input to the system are used, cleansing them may turn out to be additional work that was not necessary. Instead, cleansing on use would limit your work to what is needed by the business process.

Here is a hybrid idea: cleanse the data to determine its standard form, but don't actually modify the input data. The reason is that a variation in a name or address provides extra knowledge about the individual - perhaps a nickname, or a variation in spelling that may occur in other situations. Reducing each occurrence of a variation into a single form removes knowledge associated with potential aliases, which ultimately reduces your global knowledge of the individual. But if you can determine that the input data is just a variation of one (or more) records that you already know, storing the entered version linked to its cleansed form will provide greater knowledge moving forward.


Posted August 23, 2005 6:22 AM
Permalink | No Comments |

I am confident that, when properly distilled out of the masses of data, that the essence of knowledge lies embedded within an organization's metadata. In fact, conversations with clients often centers on different aspects of what we call metadata, often buried within topics like "data dictionary," "data standards," "XML" - yet the meaning of corporate knowledge always boils down to its metadata.

I have recently been involved in advising the formation of a new professional organization that focuses on establishing a community of practice for Metadata practitioners called the Meta-Data Professional Organization.

The intention of the MPO is to be a primary resource for exchanging ideas and advice for best practices in metadata management and implementation. I hope that this organization will be the kind of group in which individuals will share their knowledge and experience in a way that can benefit others, especially when it comes to some of the more challenging aspects of metadata, such as clearly articulating the business benefits of a metadata management program, how to assemble a believable business case, and how to develop a project plan for assessment and implementation.

If you check out the board members, you will probably see some names familiar to you from other venues, such as TDWI or DAMA.

If you have any interest in metadata, it would be worthwhile to consider how you could contribute to this organization!


Posted August 23, 2005 6:04 AM
Permalink | 1 Comment |

Do the structures described within XML schemas correspond to classes and objects described in Java or C++, or to entity relationship models? There seems to be a little bit of a debate on the topic. As an example, there does seem to be a correlation, which leads to the ability to automate the generation of Java classes that mimic XML schemas (see The Sun Java XML Binding Compiler for details).

On the other hand, the flexibility in defining schemas allows a clever (albeit, devious) practitioner to define structures that would challenge any object-oriented programmer.

I am currently looking at a project where we are reviewing the way XML schemas are defined in a way that eases the design of its supporting software. I have some definite ideas about this, but I'm interested in hearing some ideas from you readers. I will follow up on this topic, perhaps in an upcoming article.


Posted August 16, 2005 9:07 PM
Permalink | No Comments |

We hear a lot about open source software and its potential benefits to the marketplace. How about the concept of open source data? The idea is creating a repository of data that is readily available, can be configured for business benefit, and is collectively supported by a development community.

One place to start is with public data, such as what is available from the US Census Bureau.

Every 10 years, the US Census Bureau conducts a census, and as part of that process, collects a huge amount of demographic data about on avery granular level, geographically. They then spend the next 5+ years analyzing the data and preparing it for release, while at the same time preparing for the next decennial census.

The problem is that sometimes, by the time the decennial data is released it no longer accurately reflects an area's demographics. For example, consider how rapidly real estate prices have risen in the past 5 years - yet 2005 home prices are not captured in Census 2000 data. Similarly, the Tiger/Line data that contains information about street addresses is occasionally updated, but new streets and subdivisions are constantly being built, so it is likely that there are omissions in the Census data set.

There are many other public domain, public records, or generally available data sets that are of great interest to the BI community. So here is the challenge: Tell me how you feel about a project to take on a publicly available data set and create an "Open Source" approach to managing various approaches to maintaining and presenting that data. One example might be taking the Census decennial data and formulating it into a relational data structure mapped across the geographic Tiger/Line data? Pose your ideas as comments to this enrty...


Posted August 14, 2005 1:21 PM
Permalink | No Comments |

We are bound by the relationships we make (and keep, or ignore) over our lifetime. Today I had the occasion to review four different relationships, and it made me think about more than just the existence of the link I have (or had) with these people, but in the abstract, the value within a business intelligence framework of an established link.

Why was I thinking about the business value of a link? Because a consequence of the convenience of the World Wide Web, and free services such as Yahoo Groups is the inadvertent willingness of people to trade knowledge about their relationships.

So who were the four people that triggered this though process? First, as I was exiting the DC Metro, I was passed by a person who reminded me of a childhood friend with whom I had briefly reestablished a connection a few years back when I tracked down his contact information via Google.

Second, I have been exchanging voice mails with my friend Greg Elin, who is a really bright guy. Greg and I worked together on a few projects, one involving a (now questionable) idea for a web-based service that archived banner ads.

Third, I got an email from a friend and data quality colleague who had been tasked with evolving a solution to a rather sticky (and most likely intractable) Customer Data Integration project.

Fourth, there is a mail list for the employees of the company where I worked for my first job, Compass (also known as Massachusetts Computer Associates). These emails suggested that we migrate the email list to Yahoo Groups, and there seems to be some concurrence to this.

But when you take a look at Yahoo's Privacy Policy, you will see that Yahoo collects information about the transactions of its members, and uses them for research, personalization, targeting, and aggregation. One can extrapolate and assume that Yahoo tracks the relationships and micro-communities associated with the individuals that subscribe (a weak link) or participate (a stronger link) within that group. Determining the strength of the various links, the "spheres of influence," and learning "who knows who" can add a significant amount of value to an ad targeting strategy.

For example, if a particular person clicks through a specific banner ad, one might assume that other individuals within that person's sphere of influence might also have a predilection to respond to a similar ad? Of course, this can be used to refine ad targeting, increase response rates, and consequently increase the charge for ad placement.

Okay, so what does this have to do with my walk down memory lane today? The fact is that I can qualify each of those relationships with different kinds of attributes, and those attributes (and their value sets, and magnitudes, etc.) might contribute even more to analyzing its BI value. More on this topic to follow...


Posted August 11, 2005 7:32 PM
Permalink | No Comments |

Occasionally, I attend meetings on behalf of one of my government clients. Today I was at one with a set of mixed topics, although the typical agendas focus on metadata. The usual meeting attendees are individuals involved in deploying metadata registries based on the ISO/IEC 11179 standard on Metadata Registries, but today's meeting featured a presentation by Mike Daconta, who is spearheading the effort to refine the Federal Enterprise Architecture's Data Reference Model.

What is interesting about the FEA DRM, as it is called, is that it is the last piece of the Federal Enterprise Architecture to b eput in place. Confused that an enterprise architecture can be defined without focusing on data first? Join the crowd...

In fact, only recently had there been any real movement in the Data Reference Model arena, and what had been released was an XML model intended to represent a way to register data sets for the purposes of data exchange. Now, considering that Mr. Daconta comes from the Department of Homeland Security, it is not surprising that the desire to effectively share information among government agencies is helping to drive the effort forward. One consideration, though, is that the rest of the FEA focuses more on assessing and documenting government investment in technology infrastructure and how that infrastructure is put to use to improve the way Federal agencies manage their IT investments. Because of this, my perception is that there is a bit of a disconnect between the rest of the FEA and the Data Reference Model part of it.

Luckily, our friends from the Federal Data Registry Users Group are pretty smart, and were able to direct some good questions to those defining the DRM, and Mike Daconta spent some time today addressing some of those questions. The conclusion was very inspiring, in that there may be a good opportunity to inject some good ideas and guidance into the use and value of metadata into a process that affects most, if not all, agencies in the Federal Government.

For more information on the continuing saga of the FEA DRM, see the project's Wiki.


Posted August 9, 2005 7:22 PM
Permalink | 1 Comment |

I had a conversation the other day with a prospective client (the president of a company) where we discussed the value of embarking on a program to improve data analytics. I was actually impressed with the conversation - this gentleman had a background similar to mine (Computer Science degrees, time spent in the financial sector, entrepreneurial), and I could tell that he had a good understanding of what could be done when data is presented properly for analysis. Yet at the end of the conversation, I felt that we hadn't gotten anywhere...

Basically, the prospect already had a good grasp of data analytics; they already did some rudimentary analysis, segmenting their customer base into qualitative segments, and using predictive analysis about customer/product lifecycles that indicated continued customer loyalty or attrition. They were interested in doing more of this kind of exploration, yet their staff is working to capacity, so it is the president that actually does all the analysis.

The real question turned out to be: what would his company's return on investment were they to engage us to build an analytical platform? The impasse I saw was that I find it hard to project an ROI when I don't have much of an idea of what is buried within their data asset, nor can I predict what kinds of questions the organization needs answers, nor, most importantly, how the organization will make the answers actionable.

At the end of the conversation, we were both asking what the next steps would be. I proposed a longer conversation to determine business analysis criteria, and he demurred, citing a busy schedule, and a lack of being overwhelmed by the potential of what I was describing. In other words, because I could not project actionable knowledge with a calculable ROI before deploying the system, there was no business case to build a system.

This highlights a gap we often see in client BI engagements: the technology provider is building a capability that allows the client to ask the right questions, while the client expects the provider to provide the right answers. I believe that the most successful programs are the ones where expectations are "level-set": the providers and the clients team to determine how the right questions will improve the business, there is an understanding of how the technology enables asking the right questions, and the client has a plan to exploit actionable discoveries.


Posted August 7, 2005 7:49 PM
Permalink | 2 Comments |