Building Data Lakes in the Real World

Data lakes are here to stay. In a recent lab validation brief, IDC stated “data lakes should be a part of every big data workflow in the enterprise.” And in a recent blog post, EMC’s own Hadoop champion Paul Maritz proposed three ways a business can view data lakes and offered up “What CIOs Need to Know about Data Lakes.” However, as in most things in the real world, a one size fits all approach seldom works—and implementing data lakes in your organization is no exception.

The movement towards a more elastic, service-consumption model for IT—often based on or around open source technologies—continues to expand and proliferate. IT seems to have gotten the message: it’s time to stop focusing purely on control and stability. Instead, enable rapid innovation or get out of the way!

For some businesses, instituting the data lake concept means utilizing intelligent software-defined storage resource management to efficiently store petabytes of data—and making that data available with multiprotocol access. For others, it means a hyper-converged data lake that’s complete with apps, compute resources, and networks—delivered as an integrated appliance. In both cases, the decision is based on the unique challenges businesses face in delivering performance, managing growth, and gaining insights from their data.

Creating an Analytics Data Lake in Less than Two Weeks

An excellent example of implementing a data lake is Adobe Systems. They needed an analytics data lake infrastructure that offered the performance levels and scalability dictated by their Hadoop-as-a-Service initiative, and it needed to be created quickly. Together we developed a semi-virtual data lake using virtual compute resources and shared storage. The project moved from concept to a Hadoop-based proof-of-concept data lake in just a week and a half!

Adobe Systems shares their experience in developing the requirements that drove the architecture, including capacity planning, computational needs, and networking components in “Virtualizing Hadoop in Large-Scale Infrastructures.”

Apache Hadoop has become a prime tool for analyzing Big Data and helping organizations improve strategic decision-making. Adobe’s Digital Marketing organization, which operates data analytic jobs on a petabyte scale, leveraged EMC Isilon’s analytics-ready capabilities to implement a virtualized data lake for their Hadoop-based applications. Not only was the project completed in record time, it also chalked up a 65 TB Hadoop job—one of the industry’s largest in a virtualized environment.

In this case, an Isilon-based data lake is enabling Adobe to run analytics queries against entire data sets—without expensive data copying or translation – which translates into a competitive advantage.

This is the true value of a data lake in today’s business world.

About the Author: Nick Kirsch