Blog: Mark Madsen Subscribe to this blog's RSS feed!

Mark Madsen

Open source is becoming a required option for consideration in many enterprise software evaluations, and business intelligence (BI) isn't exempt. This blog is the interactive part of my Open Source expert channel for the Business Intelligence Network where you can suggest and discuss news and events. The focus is on open source as it relates to analytics, business intelligence, data integration and data warehousing. If you would like to suggest an article or link, send an e-mail to me at

About the author >

Mark, President of Third Nature, is a former CTO and CIO with experience working in both IT and vendors, including a stint at a company used as a Harvard Business School case study. Over the past decade, Mark has received awards for his work in data warehousing, business intelligence and data integration from the American Productivity & Quality Center, the Smithsonian Institute and TDWI. He is co-author of Clickstream Data Warehousing and lectures and writes about data integration, business intelligence and emerging technology.


Last week I presented at the Big Data Summit and attended Hadoop World in New York. Both events focused on the use of Hadoop and MapReduce for the processing and analyzing of very large amounts of data.

The Big Data Summit was organized by Aster Data and sponsored by Informatica and Microstrategy. Given that the summit was in the same hotel as that used for Hadoop World the following day, it would be reasonable to expect that most of the attendees would be attending both events. This was not entirely the case. Many of the summit attendees came from enterprise IT backgrounds and these folks were clearly interested in the role of Hadoop in enterprise systems. Whereas many of them were knowledgeable about Hadoop, an equal number were not.

The message coming out of the event was that Hadoop is a powerful tool for the batch processing of huge quantities of data, but coexistence with existing enterprise systems is fundamental to success. This is why Aster Data decided to use the event to launch their Hadoop Data Connector, which uses Aster's SQL-MapReduce (SQL-MR) capabilities to support the bi-directional exchange of data between Aster's analytical database system and the Hadoop Distributed File System (HDFS). One important use of Hadoop is to preprocess, filter, and transform vast quantities of semi-structured and unstructured data for loading into a data warehouse. This can be thought of as Hadoop ETL. Good load performance in this environment is critical.

Case studies from Comscore and LinkedIn demonstrated the power MapReduce in processing pedabytes of data. In the case of Comscore they are aiming to manage and analyze 3 months of detailed records (160 billion records) using Aster SQL/MR. LinkedIn, on the other hand is using a combination of Hadoop and Aster's MapReduce capabilities and moving data between the two environments. Performance and parallel processing is important for efficiently managing this exchange of data. This latter message was repeated by several other case studies at both events.

Hadoop World had a much more open source and developer feel to it. It was organized by Cloudera and had about 500 attendees. About half the audience was using Amazon Web Services and clearly experienced in Hadoop. Sponsors included Amazon Web Services, IBM, Facebook and Yahoo, all of whom gave keynotes. These keynotes were great for big numbers. Yahoo, for example, has 25,000 nodes running Hadoop (the biggest cluster has 4,000 nodes). Floor space and power consumption become major issues when deploying this level of commodity hardware. Yahoo processes 490 terabytes of data to construct its web index. This index takes 73 hours to build and has experienced a 50% growth in a year. This highlights the issues facing many web-based companies today, and potentially other organizations in the future.  

Although the event was clearly designed to evangelize the benefits of Hadoop, all of the keynotes emphasized interoperability with, rather than replacement of, existing systems. Two relational DBMS connectors were presented at the event including Sqoop from Cloudera and support for the Cloudera DBInputFormat interface from Vertica. Cloudera also took the opportunity of announcing it was evolving from a Hadoop services company to being a developer of Hadoop software.

The track sessions were grassroots Hadoop-related presentations. There was a strong focus on improving the usability of Hadoop and adding database and SQL query features to the system. I felt on several occasions many people were trying to reinvent the wheel and trying to solve problems that had already been solved by both open source and commercial database products. There is a clear danger of trying to expand Hadoop and MapReduce from being an excellent system for the batch processing of vast quantities of information to being a more generalized DBMS.  

The only real attack on existing database systems came surprisingly from the J. P. Morgan financial services company. The presentation started off by denigrating current systems and presenting Hadoop as an open source solution that solved everyone's problems at a much lower cost. When it came to use cases, however, the speakers positioned Hadoop as suitable for processing large amounts of unstructured data with high data latency. They also listed a number of "must have" features for the use of Hadoop in traditional enterprise situations: improved SQL interfaces, enhanced security, support for a relational container, reduced data latency, better management and monitoring tools, and an easier to use developer programming model. Sounds like a relational DBMS to me. Somehow the rhetoric at the beginning of the session didn't match the more practical perspectives of the latter part of the presentation.

In summary, it is clear that Hadoop and MapReduce have an important role to play in data warehousing and analytical processing. They will not replace existing environments, but will interoperate with them when traditional systems are incapable of processing big data and when certain sectors of an organization use Hadoop to mine and explore the vast data mountain that exists both inside and outside of organizations. This makes the current trend toward hybrid RDBMS SQL and MR solutions from companies such as Aster Data, Greenplum and Vertica an interesting proposition. It is important to point out, however, that each of these vendors takes a different approach to providing this hybrid support and it is essential that potential users match the hybrid solution to application requirements and developer skills. It is also important to note that Hadoop is more than simply MapReduce.   

If you want to get up to speed on all things Hadoop, read some case studies, and gain an understanding of its pros and cons versus existing systems then get Tom White's (I am not related!) excellent new book "Hadoop: The Definitive Guide" published by O'Reilly.

Posted October 6, 2009 1:37 PM
Permalink | No Comments |

Leave a comment