20 Node Hadoop Cluster, With Hive, No Pig Please

We can all agree that Hadoop is a key component to a Big Data strategy, but without a simple and fast way to stand up a complex Hadoop system, IT cannot deliver the value promised by Big Data. The good news is that VMware has addressed the challenge with Project Serengeti, enabling enterprises to quickly deploy, manage, and scale Apache Hadoop in virtual and cloud environments. Available for free download under the Apache 2.0 license, Serengeti allows enterprises to leverage the VMware vSphere platform to deploy a Hadoop cluster in minutes, including common Hadoop components such as HDFS, MapReduce, Pig, and Hive.

EMC further accelerates the Hadoop deployment process through the only end-to-end Big Data storage and analytics solution that leverages the power of VMware Serengeti. By combining Serengeti with EMC Isilon scale out NAS and EMC Greenplum HD (100 percent open source certified and supported version of the Apache Hadoop stack), you will be able to deploy and configure a Hadoop analytics solution faster and more cost effective than other deployment option.  Watch a quick demonstration on how this powerful Hadoop analytics solution is deployed in minutes.

Want more details? Fausto Ibarra, Sr. Director of Product Management, Data and Analytics at VMware explains the value proposition of Serengeti.

1.  First thing first, how can Serengeti deploy a standardized Hadoop cluster with a single command in under 10 min?

It is a very easy process. You install the Serengeti virtual appliance and then download the Hadoop distribution of choice into a virtual machine.  You simply tell Serengeti a few parameters such as the number of nodes you would like for your Hadoop cluster, the amount of memory and storage, and then Serengeti will rapidly provision a VM for each node in your Hadooop cluster. For example, if you want you create a 20-node Greenplum HD cluster, Serengeti tells VSphere to create 20 VMs with each VM containing a configured Greenplum HD node, plus a few additional VMs for the Hadoop nodes.

2.  It was already possible to run Hadoop in a virtualized environment. How is Serengeti adding value?

There are several areas where Serengeti adds value and is a better Hadoop deployment strategy.

First, is ease of deployment. Serengeti automates the entire process of creating the VMs and installing and configuring the Hadoop and worker nodes.

Second, is high availability . Serengeti has made Hadoop ‘virtualization aware’ which is not only important for optimized performance, but also for intelligent data replication needed for high availability. Serengeti creates an additional layer called ‘node groups’ to identify nodes that are physically grouped so Hadoop can replicate data across ‘node groups’ to ensure high availability.

Third, is elasticity. You can easily grow and shrink the Hadoop cluster as you need, especially in a multi-tenant environment.

Last but not least, is flexibility. Serengeti supports all the major Hadoop distributions, including Apache, Greenplum, HortonWorks, and Cloudera.

3.  What are the performance implications when running Hadoop in a virtualized environment?

One of the advantages of virtualizing Hadoop is that there is virtually no performance impact. We have seen only single digit performance increase or decrease depending on the application. We have worked with our Hadoop distribution partners and the open source community to optimize Hadoop performance. As a result, we published several benchmarks and found that with a single VM per host in a 7-node cluster, the average increase in elapsed time over the native configuration was 4%. This is a small price to pay for all the other advantages offered by virtualization. Running two or four smaller VMs on each host resulted in average performance better than the corresponding native configuration; some cases were up to 14% faster.

4. What are the advantages of using EMC Isilion scale out NAS with virtualized Hadoop compared to Local Disk, SAN or other storage options?

Isilon supports HDFS natively therefore is a great deployment strategy because you gain all the benefits of scale out NAS in a Hadoop virtualized environment – incremental scalability, throughput and performance, HA, data protection, etc. In fact, Serengeti enables you to quickly deploy a complete and powerful Big Data storage and analytics solution through the deployment of Greenplum HD using Isilon storage as the HDFS file system.

5. What is the advantage of using Greenplum HD over other Hadoop distributions that are supported?

Greenplum HD is optimized with Isilon so this is a powerful Big Data deployment strategy, which I just mentioned earlier.  Also, we are working closely with the Greenplum team at EMC to incorporate all of our contributions to the Apache Hadoop project into Greenplum HD, which help to optimize Greenplum HD to run better in a virtualized environment. In fact, in a few weeks, Greenplum HD 1.2 will be one of the first distributions that include the Hadoop Virtualization Extensions (HVE) that VMware has contributed to the Apache Hadoop community.

6. With new product developments and acquisitions, one can conclude that VMWare is striving to create a leading platform for “big, fast and flexible data in the cloud”. VMWare released Spring Hadoop to help developers create applications with Apache, purchased of online analytics provider Cetas, and unveiled its new in-memory database, SQLFire. What’s next in VMware’s Big Data ambitions?

VMware is transforming IT into a software defined data center.  We are already virtualizing compute, storage and networking. When it comes to Big Data, our goal is to continue to make vSphere the best platform to run Big Data workloads such as Hadoop, HBase, and real time analytics. This will accelerate the adoption of Hadoop and will enable the creation of many new Big Data applications across many industries.

To download and contribute to Project Serengeti, please visit  

About the Author: Mona Patel