EMC Hadoop Starter Kit: Creating a Smarter Data Lake

Pivotal HD offers a wide variety of data processing technologies for Hadoop – real-time, interactive, and batch. Add integrated data storage EMC Isilon scale-out NAS to Pivotal HD and you have a shared data repository with multi-protocol support, including HDFS, to service a wide variety of data processing requests. This smells like a Data Lake to me – a general-purpose data storage and processing resource center where Big Data applications can develop and evolve. Add EMC ViPR software defined storage to the mix and you have the smartest Data Lake in town, one that supports additional protocols/hardware and automatically adapts to changing workload demands to optimize application performance.

EMC Hadoop Starter Kit, ViPR Edition, now makes it easier to deploy this ‘smart’ Data Lake with Pivotal HD and other Hadoop distributions such as Cloudera and Hortonworks. Simply download this step-by-step guide and you can quickly deploy a Hadoop or a Big Data analytics environment, configuring Hadoop to utilize ViPR for HDFS, with Isilon hosting the Object/HDFS data service.  Although in this guide Isilon is the storage array that ViPR deploys objects to, other storage platforms are also supported – EMC VNX, NetApp, OpenStack Swift and Amazon S3.

I spoke with the creator of this starter kit James F. Ruddy, Principal Architect for the EMC Office of the CTO to explain why every organization should use this starter kit optimize their IT infrastructure for Hadoop deployments.

1.  The original EMC Hadoop Starter Kit released last year was a huge success.  Why did you create ViPR Edition?

Organizations that are deploying Hadoop as dedicated environments are creating more data siloes in the organization. This guide will enable customers to minimize data siloes by deploying any of the three most popular Hadoop distributions (Pivotal, Cloudera, Hortonworks) utilizing EMC ViPR software defined storage, enabling organizations to leverage existing investments in storage platforms/infrastructures for Big Data analytics. There are massive amounts of data already living in storage platforms whereby ViPR will ‘analytics’ enable those storage arrays without having to create a separate dedicated Hadoop environment.

2.  What are the best use cases for HSK ViPR Edition?

First, you can instantly deploy a Big Data repository through utilizing existing enterprise storage capacity as “Data Lakes” on top of which to enable analytics.

Second, you can reduce the growth in dedicated Hadoop environments since large volumes of unstructured data already living in EMC storage or third party such as NetApp arrays can be now exploited through Hadoop programs.

Third, you can eliminate the need to have multiple copies of the same data for different types of applications through ViPR’s support for multiple protocols/mixed workloads.  ViPR will enable dual mode access to the data under its management, enabling object based workloads and analytics applications to manipulate the same data since ViPR provides S3, Swift and Atmos APIs interface support as well as HDFS API access.

3.  So what are the pre-requisities for HSK ViPR Edition?

The guides are designed to enable the use of ViPR as a Hadoop compatible file system that resides as object storage on top of an existing ViPR supported file storage array. So to start you need a file system array that you can deploy ViPR data services in front of. For the compute side you need either physical or virtual machines to run the hadoop cluster. Anywhere from one to many can be used. The guides walk you through the automated deployment tools available through each distribution and shows how to use the native management tools to integrate ViPR HDFS services.

About the Author: Mona Patel