Hadoop-as-a-Service: An On-Premise Promise?

Hadoop-as-a-Service (HaaS) is generally referred to Hadoop in the cloud, a handy alternative to on-premise Hadoop deployments for organizations with overwhelmed data center administrators that need to incorporate Hadoop but don’t have the resources to do so. What if there was also a promising option to successfully build and maintain Hadoop clusters on-premise also referred to HaaS? The EMC Hybrid Cloud (EHC) enables just this – Hadoop in the hybrid cloud.

EHC, announced at EMC World 2014, is a new end-to-end reference architecture that is based on a Software-Defined Data Center architecture comprising technologies from across the EMC federation of companies: EMC II storage and data protection, Pivotal CF Platform-as-a-service (PaaS) and the Pivotal Big Data Suite, VMware cloud management and virtualization solutions, and VMware vCloud Hybrid Service. EHC’s Hadoop-as-a- Service was demonstrated at last week’s VMworld 2014 San Francisco – the underpinnings of a Virtual Data Lake:

EHC leverages these tight integrations across the Federation so that customers can extend their existing investments for automated provisioning & self-service, automated monitoring, secure multi-tenancy, chargeback, and elasticity to addresses requirements of IT, developers, and lines of business. I spoke with Ian Breitner, Global Solutions Marketing Director for Big Data, to explain why EMC’s approach to HaaS should be considered over other Hadoop cloud offerings.

1.  In your opinion, what are the key characteristics of HaaS?

Before we delve into this I want to define what I mean by Hadoop for this post. Hadoop means the original framework for large-scale data processing on a cluster of commodity components. Originally it comprised of a set of utilities and tools, a File System (HDFS), a resource scheduler (YARN) and an analytics engine (Map Reduce) designed to process large amounts of unstructured data in an efficient fashion.

For me the key to providing anything ‘aaS’ is to provide it as a utility. Basically as a consumer of Hadoop, I want to have the service rapidly provisioned and access available when I want it, and to pay for only what I consume- and by the way it needs to be relatively inexpensive. There are a number of activities that need to occur before I am able to consume Hadoop and to me as a consumer I don’t care or need to know about them, but for the organization providing me the Hadoop utility it is important: the provision of a self-service portal, metering and chargeback mechanisms, tenant isolation, policy management framework, and management and monitoring tools.

2.  What is the value of HaaS over bare metals deployment?

Having a HaaS model means that I, as the consumer of Hadoop, can purchase what I need, when I need it, and only for the duration of its use. This is far more attractive than going down the “bare metal” route. There are also benefits to having an ‘aaS’ model where the equipment being used can be re-allocated to other workloads when not being used by Hadoop workloads.

Deploying Hadoop on bare metal- perhaps it is more accurate to say on dedicated hardware requires capital investment, datacenter floor space, HVAC, power, and a variety of technical skills (meaning additional staff). As a consumer of Hadoop, I now have to worry about managing these additional items – and if I need to grow my Hadoop cluster, I have to invest additional funds to expand the cluster and its associated items, and there is the high likelihood of under-utilization of the hardware.

3.  EMC first introduced a methodology for HaaS with the EMC Hadoop Starter Kit (HSK). How does EHC provide a more complete solution for HaaS?

HSK allows you to get started with a HaaS offering using the VMware Big Data Extensions to create virtualized Hadoop deployments. But there are many missing parts that would be required to provide this in a utility model. EHC, however, is another animal (see diagram below). EHC includes all the components to create a utility model and provide ‘aaS’ offerings. One of the items that comes with EHC are the required vCAC blueprints to deploy Hadoop, and these can be used to create the service catalog that allows a self service model to be deployed. EPC2.0solutionguideoverview-325x355 4.  The term HaaS is still evolving, but the industry generally refers to HaaS as a replacement to on-premise Hadoop, with providers such as Amazon Web Services accounting to nearly 85% of the global market HaaS revenue in 2013.  What makes EHC a better choice over HaaS providers such as AWS?

There are a number of items that I would like to address here. The first is the perception that an ‘aaS’ offering must be from a Service Provider. IT departments are perfectly capable in providing a similar utility model, especially with the EHC solutions available from EMC. The major issue for IT was and still is the budget constraints within they need to operate. They could not afford the skilled staff required to create the infrastructure, and they also had capital constraints. This meant that the organizations like sales and marketing needed to find other ways to achieve their goals – offerings like AWS EMR were and are attractive.

The issues in using AWS for Hadoop workloads comes once these workloads go from prototype and test to production and then the data sets grow. With this growth comes increasing costs and eventually the marketing or sales organization will say to IT “go and run this for us”. Now what? By choosing to use EHC and running HaaS, the consumers have access to a utility computing model that can meet their needs, and at the same time provides IT with the infrastructure to deliver the services. And as a bonus it is also possible to elastically expand into a Service Provider offering for those occasional workloads needing additional temporary capacity.

5.  Who are the ideal candidates for EHC? HSK?

Those organizations that want to run a Hadoop POC or learn how they might apply this new analytics model to their unstructured or semi structured data are ideal candidates for the HSK – especially if they already own Isilon – expanding the existing platform is easy and transparent.

Those organizations that want to provide HaaS to their internal customers are ideal candidates for EHC – typically these would be Enterprise customers. With EHC, IT organizations can broker services from private and public clouds, enabling visibility and control over the best location to run business applications. For example, you can push your EHC HaaS deployment to VCloud Air with ease when needed.

Also those organizations that have started to use or are using AWS EMR are also candidates for EHC to run HaaS.

About the Author: Mona Patel