Traditionally customers deploy Hadoop on dedicated servers with a dedicated Hadoop file system on top of direct attached storage. Loading data into this Hadoop silo can be a challenge and time consuming. This is an outdated approach and dedicated hardware is not needed anymore.
With VMware you already have a server infrastructure to run Hadoop and Isilon allows you to leverage HDFS directly on your scale-out NAS.
I had the chance to test this setup at University of Fribourg and run a couple of Hadoop jobs on a small 4 Node ESX Cluster (VSPEX Blue) pointed to a 3 Node Isilon system for shared HDFS.
To get started you need to invest 0$. Just download the Software;
- VMware Big Data Extension 2.3
- existing Isilon NAS or IsilonSD (Software Isilon for ESX)
- Hortonworks, Cloudera or PivotalHD
- EMC Isilon Hadoop Starter Kit (documentation and scripts)
VMware Big Data Extension
VMware Big Data Extension helps to quickly roll out Hadoop clusters. BDE is a virtual appliance based on Serengenti and integrated as a plug-in to vCenter. https://www.vmware.com/support/pubs/vsphere-big-data-extensions-pubs.html
The vApp deploys a management VM and a template to clone Hadoop VMs from;
One can manage BDE through CLI or in vCenter. For consistency it’s recommended to do either CLI or GUI management only. From the new management-server VM you can start Serengeti CLI and connect it to vCenter. Add networks and datastores to be used by BDE and you’re ready to roll out the first cluster:
This command creates a cluster with 1 master node and 16 workers. A name node is not needed. It runs as a clustered service directly on all Isilon nodes. The vspex-bde.json file contains the configuration for the cluster nodes. Like how many nodes, memory and vCPU per node, master / worker node, etc.
In vCenter the cluster is named accordingly and the VMs are visible;
Now you have basically 17 naked VMs for Hadoop without a distribution installed. This is important to understand. BDE only automates the deployment of VMs and makes it easy to adjust vCPU, Memory count and add or remove worker VMs.
In the next step you can start to install your preferred Hadoop distribution. I tested Hortonworks and Cloudera utilizing an existing 3 Node Isilon Cluster as HDFS Store.
HDFS on Isilon scale-out NAS
For HDFS we have an Isilon which is a multiprotocol NAS platform. This means the data can be stored through any protocol like NFS, CIFS and directly analyzed by Hadoop nodes through HDFS as a protocol. If you don’t have an Isilon cluster, you can download the software only version for free use. www.emc.com/getisilon
With HDFS on Isilon one can benefit from all Enterprise NAS features like replication, snapshots and backup integration. Plus the usable capacity on Isilon is 80% of raw. Instead of < 33% with HDFS on DAS.
Quote from Cloudera; ”In the decade and a half since Google originally designed that architecture, though, the industry has made some progress. A few years back we worked with EMC to integrate the Isilon storage system with Hadoop, and we have customers running Cloudera today against large-scale Isilon stores. Separating the storage grid from the compute grid turns out to have some administrative and cost advantages for enterprise deployment at scale.”
EMC Hadoop Start Kits
Before installing your preferred Hadoop distribution, have a look at EMC Hadoop start kits. It includes scripts to prepare your VMs for Hadoop deployment. Like scripts to enable SSH, mount a common NFS share as repository and install prerequisite packages on all VMs
The start kit is available for Pivotal, Cloudera and Hortonworks
As preparation for Hortonworks we need to enable the Ambari agent on Isilon and create the required users. During HDP Setup, when registering your hosts, just add the name of the Isilon cluster as one of the hosts; and check to use Isilon as the DataNode only;
The Setup with Cloudera Manager is even straighter forward. In the list of services to install one can just choose Isilon as the HDFS Layer;
With the Hadoop cluster ready it’s finally time for some performance tests.
The compute nodes are four nodes with an E5-2620 each all in one 2U chassis and I’ve deployed 16 VMs as Hadoop worker nodes. Each worker VM has 2 vCPU and 16GByte of Memory.
The HDFS Layer is running on a small Isilon Cluster with 3 physical Nodes.
Teragen is not a compute intensive workload and the bottleneck is typically the storage layer. I’ve generated 100GByte of random data and wrote it to HDFS on Isilon.
30 tasks, about 2 tasks per VM, have shown to give decent performance.
[[email protected] ~]# time hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar teragen 1000000000 /hadoop/teragen/cloudera/30task-100GB
Isilon GUI shows that inbound throughput jumps to 15-19Gbit/s. That’s a pretty decent number for writes on 3*23 Disks protected with FEC on a distributed files system. Each Isilon Node processes about 700-800 MByte/s ingest.
The teragen job of 100GByte completed in 1 minute and 11 seconds.
Next up is the terasort benchmark which is more a compute bound job. For testing sake I run the terasort against the 100GB of random data generated in the last test;
[[email protected] ~]# time hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar terasort /hadoop/teragen/cloudera/30task-100GB /hadoop/teragen/cloudera/30task-100GB-out
Terasort by default does mapping first and starts the reduce tasks when mapping is past 85%. After 7 minutes all data was mapped and reduce was at 25%… After 10 minutes the job completed.
Interestingly the load on Isilon was very little during terasort and the utilization of the nodes remained <10%. Only between 0-500 MByte/s were read (while the 3 Isilon Nodes could easily deliver >3GByte/s of reads.) This is because the bottleneck was on the compute side as witnessed in Cloudera Manager and in vCenter;
At the beginning of the terasort job, the cluster CPU utilization jumped to 100%;
And in vCenter alerts of vCPU usage popped up showing all 16VMs consuming around 50GHz. That’s about half of the power available to the 4 ESX hosts.
This POC and many other customer examples have shown that with today’s virtualization, networking and storage technology also a Hadoop cluster can run on shared infrastructure leveraging enterprise grade availability, efficiency and reliability. The use of Isilon provides the unique capability to analyze the data directly where it was created on the multi-protocol scale-out NAS platform. Virtualizing Hadoop means increased efficiency; you can start very small and offer Hadoop as a service. Multi tenancy can be implemented end-to-end.
A BIG THANK YOU to www.datastore.ch to provide the equipment and support for setting up this POC @ University of Fribourg.
What’s the next hot thing?
If you need more, much more performance and plan to do real-time analytics, check out the new release of DSSD D5.
Cloudera is also excited about it, check out their view; https://vision.cloudera.com/when-andy-bechtolsheim-starts-a-company-you-pay-attention/