Blog: David Loshin Subscribe to this blog's RSS feed!

David Loshin

Welcome to my BeyeNETWORK Blog. This is going to be the place for us to exchange thoughts, ideas and opinions on all aspects of the information quality and data integration world. I intend this to be a forum for discussing changes in the industry, as well as how external forces influence the way we treat our information asset. The value of the blog will be greatly enhanced by your participation! I intend to introduce controversial topics here, and I fully expect that reader input will "spice it up." Here we will share ideas, vendor and client updates, problems, questions and, most importantly, your reactions. So keep coming back each week to see what is new on our Blog!

About the author >

David is the President of Knowledge Integrity, Inc., a consulting and development company focusing on customized information management solutions including information quality solutions consulting, information quality training and business rules solutions. Loshin is the author of The Practitioner's Guide to Data Quality Improvement, Master Data Management, Enterprise Knowledge Management: The Data Quality Approachand Business Intelligence: The Savvy Manager's Guide. He is a frequent speaker on maximizing the value of information. David can be reached at or at (301) 754-6350.

Editor's Note: More articles and resources are available in David's BeyeNETWORK Expert Channel. Be sure to visit today!

Recently in Challenge Category

There is an oft-quoted statistic about the growth rate of data volumes that I wanted to use in some context, and I started searching for a source. I googled "data volumes" +"double every" to see what I could find, and to my surprise, lots of hits, but it is difficult to pin down the exact parameters. Lots of folks are using the statistic:

"Data doubles every year"
"The amount of stored data from corporations nearly doubles every year"
"...the amount of data stored by businesses doubles every year to 18 months."
"In his book “Simplicity,” business management expert and author Bill Jensen indicates that the most conservative estimates show business information doubling every three years, while some estimates say data doubles every year. "
"Unstructured data doubles every three months"

I am still following links from the first page of results, and we are doubling our data every 3 to 18 months.

"Reed's Law states that the volume of data doubles every 12 months. "

OK, so there is actually a law about it. Hold on a second, according to wikipedia this law is about the utility of (social) networks, so perhaps the law doesn't apply in all jurisdictions.

Anyway, these may all be references to a UC Berkeley study on the growth of data , which said that the amount of information stored on media such as hard disk drives doubled between 2000 and 2003.

So let's look at this a little more carefully - we have a scientific study that looks not at the creation of data, but rather the use of storage media to hold what is out there. And out there is a lot of stuff needing a lot of storage, like images, music, videos, etc. Things that have information yet from which are still a challenge to extract data. Also, consider that for each thing out there, there are likely to be a lot of copies! I am sure that a scan of all the TiVos in the country would demonstrate that lots of people are still catching up on older episodes of 24 and American Idol.

I need to refine my question a little bit, then, but I am afraid it will be difficult to track down defensible sources for it. I am more interested in knowing about the growth rate for data that can be integrated into an actionable information environment. I may not care about the bits comprising that specific episode of 24 that is sitting on millions of DVRs, but as an advertiser, I might be interested in profiling which households have watched which episodes and at what kind of time shift.

Anyone have any ideas?

Posted January 23, 2008 10:48 AM
Permalink | No Comments |

I had a conversation the other day with one of my former colleagues, and I asked him his opinion about whether approximate matching and semantic techniques would be integrated into search engines. His response surprised me: he told me that he had read that over 90% of google searches involve a single word, and that in the absence of information, the engine didn't have that much to work with. Therefore, was it really worth it to add this increased functionality if, for the most part, it would add computation time but only benefit a small number of searchers?

That, of course, shocked me, but maybe it shouldn't have. I thought I was pretty good at googling, mostly because I was able to get pretty good results as a by-product of the feedback I get from each search. For example, you start with a phrase in quotes, and that may be sufficient. If not, you can scan the short results coming back to seek out better phrases to include (or exclude) from the search. Others are much more comprehensive in their searching, using qualifiers and key tokens to enhance their search (e.g., Johnny Long, who will be a keynote speaker at the upcoming Data Governance conference in San Francisco, at which I will also be speaking, by the way).

But perhaps the general computer user is not so sophisticated, and may need some suggestions. Anyone want to contribute their favorite search strategies?

Posted May 25, 2007 6:33 AM
Permalink | No Comments |

Even top management at open source BI companies seem to feel that the costs associated with deploying open source projects are roughly the same as going the traditional route. On the other hand, since the costs for deploying an open source solution are largely on the back-end (e.g., paying people to do things) instead of the front-end (e.g., software licensing fees), there might be a greater ability to start a BI project using open source tools than trying to justify the costs just to get to the starting gate.

However, in terms of innovation, open source projects often trail the traditional commercial tool vendors. Open source projects grow by community participation, in which lots of contributors make things happen, or through acquisition, in which components are added to the mix through negotiated deals. So while there are some benefits to starting with open source, I suspect the general process might be to migrate over time to a traditional commercial product.

Here is the challenge: I am interested in experiences using open source Business Intelligence software, good, bad, ugly, or beautiful. Feel free to email me or post directly to the blog. I am looking forward to some responses!

Posted November 15, 2006 10:34 AM
Permalink | 1 Comment |

At times, our consulting practice is faced with a conundrum: the evolution of certain technologies and practices for enhanced information exploitation suggest changing business operations in a way that might reduce, or even eliminate, some participants' roles. In other words, implementing technical changes to benefit the organization simultaneously have a determinental impact on individuals within the organization.

In terms of self-preservation, it is not in the best interests of these individuals to support new technical initiatives that might result in their own termination. Yet in order to do their job the right way, they are obliged to do what is right for the organization, right? This situation resembles the game theory concept of a zero-sum game, in which moves that benefit one player equally have a negative impact on another player.

The challenge, then, is to determine how to socialize the evolution of the program in a way that demonstrates mitigation for any individual impacts or displacements. For example, when suggesting an action whose side effects include the elimination of a specific person's role, seek ways to evolve that person's responsibilities to support the change process and long term maintenance of the technical evolution. Doing so will finesse the "zero-sum" situation and will provide new challenges for both staff training and organizational improvement.

Posted September 26, 2006 11:21 AM
Permalink | 2 Comments |

I have written extensively about the value of developing a data standards program as part of a data governance framework, and so far we have convinced a number of clients as well. In fact, Knowledge Integrity is looking for a motivated individual to join our Data Standards team at one of our client sites. Click here for more information.

Posted July 12, 2006 10:45 AM
Permalink | No Comments |

While I was doing some random web searching, I came across an interesting web page that provides some training on finding MP3s using Google. Not that I am suggesting that search engines be used for unacceptable behavior, but my curiosity is piqued by the more general concept of "getting around the rules," and how that concept relates to the more piquant topic of compliance.

There are two approaches to compliance. the first is doing what you need to do to comply; the second is seeing how much you can do to avoid being compliant.

Here is a quick, although probably dated example: During the 1980s and 1990s, police would set up speed traps employing radar systems to determine how fast cars were traveling. As the goal was to identify (and punish) drivers exceeding the posted speed limit, this reflects a simple model of compliance. Drivers who were inclined to speed could react in one of two ways. The first (for the "compliers") was to drive slower (become compliant). The second (for the "avoiders") was to purchase some technology (a radar detector) that would notify the driver when the radar monitoring was taking place and allow the driver to slow down during the monitoring phase, but then resume the noncompliant behavior when there was limited risk of being caught.

Do organizations opt for one or the other of these approaches? What is the risk/reward model? To look at our example, those who became compliant were penalized to some extent by having to reduce their speed and get to where they were going more slowly. There was some monetary investment on behalf of the avoiders (the cost of the radar detector), but otherwise they were rewarded for their noncompliance, since they still get to their destinations more quickly, with some limited risk of getting caught nonetheless.

Is it better to be a complier or an avoider? How does an organization determine its approach, and then communicate that approach to the individuals within the organization? And lastly, I wonder whether there is some middle ground between these two options. Any comments?

Posted May 8, 2006 10:33 AM
Permalink | 1 Comment |

This morning, I read a very interesting story about how there must have been some apparent interactions between government folks and printer manufacturers that resulted in the embedding of encoded information on printed pages. This message, embedded as a series of yellow dots only visible using a magnifying glass and blue light, was determined to be a digital signature used by the US Secret Service to "prevent illegal activity," (probably money counterfeiting).

From a privacy point of view, it is always jarring to hear about ways that activity is being tracked without the target's awareness, but those of us in the Business Intelligence world know that there are many ways that individual activity may be (and probably is) being tracked. And sometimes, people even are happy to "be tracked," if it results in money savings or better efficiency. I am sure that there are many sideline privacy "activists" that participate in supermarket "clubs" or frequent flyer programs.

The question I want to throw out to the blogspace is: where is the line between beneficial tracking and invasive tracking?

Posted October 19, 2005 7:56 PM
Permalink | 1 Comment |

Here is my latest challenge to you readers: I had a heck of a time trying to explain to some colleagues the value of presenting the results of measured metrics tied to business performance, and I need some help in figuring out a good way to do it.

Consider the scenario: An organization has the ability to provide some basic reporting statistics on the technical (and some of the operational) aspects of their applications. But it is not necessarily clear what business value is being provided by these statistics, and whether these metrics are relevant to achieving business objectives.

Perhaps because I eat, live, and breathe data and business intelligence, the definition and use of business performance metrics tied to business data is obvious. In this case, the questions revolve around performance improvement. But how can you improve a process if you can't measure how successful you are at it in the first place?

Here is what I want to be able to convey:
1) Before improving a process, determine what the success criteria are
2) Success criteria are communicated via meeting or beating a threshold of expected performance
3) Expected performance is quantified by measuring key performance indicators
4) Key performance indicators are measurable aspects of the outputs of your process
5) Continued monitoring of your key performance indicators helps in alerting the business manager when a process does not perform within expected control limits

That is what I wanted to say, but I think it kind of came out like this:

"... bla bla bla performance indicators bla bla ... process improvement ... bla bla bla ... control limits ... bla bla bla"

Suddenly, I am gripped with fear that I have transformed into a buzz-phrase spewing robot. (Don't worry, I got over it pretty quickly.)

But here is the challenge: How do you effectively communicate the value of systemic thinking and reporting to an audience largely experienced in procedural and operational processing?

Posted September 19, 2005 7:40 AM
Permalink | No Comments |

We hear a lot about open source software and its potential benefits to the marketplace. How about the concept of open source data? The idea is creating a repository of data that is readily available, can be configured for business benefit, and is collectively supported by a development community.

One place to start is with public data, such as what is available from the US Census Bureau.

Every 10 years, the US Census Bureau conducts a census, and as part of that process, collects a huge amount of demographic data about on avery granular level, geographically. They then spend the next 5+ years analyzing the data and preparing it for release, while at the same time preparing for the next decennial census.

The problem is that sometimes, by the time the decennial data is released it no longer accurately reflects an area's demographics. For example, consider how rapidly real estate prices have risen in the past 5 years - yet 2005 home prices are not captured in Census 2000 data. Similarly, the Tiger/Line data that contains information about street addresses is occasionally updated, but new streets and subdivisions are constantly being built, so it is likely that there are omissions in the Census data set.

There are many other public domain, public records, or generally available data sets that are of great interest to the BI community. So here is the challenge: Tell me how you feel about a project to take on a publicly available data set and create an "Open Source" approach to managing various approaches to maintaining and presenting that data. One example might be taking the Census decennial data and formulating it into a relational data structure mapped across the geographic Tiger/Line data? Pose your ideas as comments to this enrty...

Posted August 14, 2005 1:21 PM
Permalink | No Comments |