Want to Explore Hadoop, But No Tour Guide?

Are you a VMware Vsphere customer? Do you also own EMC Isilon? If you said yes to both, I have great news for you – you have all the ingredients for the EMC Hadoop Starter Kit (HSK).  In just a few short hours you can spin up a virtualized Hadoop cluster by downloading the HSK step-by-step guide.  Watch the demo below of HSK being used to deploy Hadoop:

Now you don’t have to imagine what Hadoop tastes like because this starter kit is designed to help you execute and discover the potential of Hadoop within your organization. Whether you are new to Hadoop or an experienced Hadoop user, you will want to take advantage of this turnkey solution for the following reasons:

-Rapid provisioning – From the creation of virtual Hadoop nodes to starting up the hadoop services on the cluster, much of the Hadoop cluster deployment can be automated, requiring little expertise on the user’s part.

-High availability – HA protection can be provided through the virtualization platform to protect the single points of failure in the Hadoop system, such as NameNode and JobTracker Virtual Machines.

-Elasticity – Hadoop capacity can be scaled up and down on demand in a virtual environment, thus allowing the same physical infrastructure to be shared among Hadoop and other applications.

-Multi-tenancy – Different tenants running Hadoop can be isolated in separate VMs, providing stronger VM-grade resource and security isolation.

-Portability – Use any Hadoop distribution throughout the Big Data application lifecycle with zero data migration – Apache Open Source, Pivotal HD, Cloudera, Hortonworks.

I spoke with the creator of this starter kit James F. Ruddy, Principal Architect for the EMC Office of the CTO to explain why every organization that uses VMware Vsphere and EMC Isilon should use this starter kit for Big Data projects.

1.  Why did you create the starter kit and what are the best use cases for this starter kit?

Some of the barriers to Hadoop deployment are lack of equipment and expertise, but we found that our Isilon customers with virtualized environments have an advantage. Through some simple downloads of free software and documented configuration steps, we can actually help customers deploy Hadoop for either sandbox or enterprise production environments. Here are some of the use cases we have seen for HSK:

  1. Moving off of Amazon and bringing it in house. Business Units within organizations that cannot wait for IT to bring Hadoop into the enterprise will go to amazon or the cloud. With HSK, IT can easily bring Hadoop as a service to meet business objectives for the enterprise, or create an environment to migrate off of external hadoop providers.
  2. Initial Hadoop projects or sandbox environments that simply want to start experimenting with data.
  3. Large volumes of unstructured data already living in Isilon that can be now exploited through Hadoop MapReduce programs.

2.   At a high level, can you walk us through HSK?

For existing Isilon and Vsphere customers, HSK aims to automate the deployment of virtualized Hadoop clusters using native HDFS integration with Isilon. HSK walks you through acquiring all of the needed software and license components and subsequent configuration steps for deployment of Big Data Extensions, HDFS, and Hadoop clusters. Once deployed, you can then access Hadoop in a virtualized environment with native HDFS access to Isilon. To get your feet wet with Hadoop and see how it works, we then walk you through a sample word count application by simply downloading some text files.

3.  So what are the pre-requisities?

All the software is free. All you need is an existing EMC Isilon cluster and a VMware Vsphere 5.2 environment you can utilize.

4. Sounds foolproof. Can anything go wrong?

It is a pretty straight forward and simple install, and we tested HSK with several users. The only issue that popped up during our testing was a connection error. This error was because SSO was not getting the correct time from the deployed vAPP. This was easily rectified by either manually setting the time on the management server or using NTP on the ESX hosts.

5. What value do you get from virtualizing Hadoop over traditional Hadoop deployments?

There are many but let me start with ease of management through Big Data Extensions from VMware. You don’t have to be a Hadoop expert since the management of Hadoop clusters is automated. Another value proposition is simplicity as you have a central data repository on Isilon, enabling the use of different Hadoop distributions against a shared data service. You can spin up Hadoop clusters based on Apache Open Source, Pivotal HD, Cloudera, Hortonworks to access the same data repository, eliminating the need to ingest the data multiple times through HDFS.  In fact, if you already have data living in Isilon being access through protocols such as NFS or SMB, you don’t even need to ingest the data through HDFS since Isilon supports all protocols.

About the Author: Mona Patel