I am really a sci-fi geek at heart. The second Hobbit movie in the new trilogy came out last December, and I couldn’t wait. Being a huge Lord of the Rings fan, I am reminded of the first line in the Lord of the Rings:
“I amar prestar sen”.
The world has changed. The same can be said when it comes to Cloud Computing and Big Data Analytics. As datasets are becoming larger and more distributed, companies are increasing turning to the cloud to store them. At the same time, the business potential for Big Data is growing. Gartner recently released a study that the financial value of big data is increasing across verticals – Retail can see a 60%+ increase in net margin, US healthcare could grow to $300 billion per year, and the personal mobile data market could realize as much as $100 billion per year for service providers.
Enter EMC ViPR Data Services
EMC ViPR has been designed to address the growing ubiquitous nature of data storage by transforming existing heterogeneous physical storage into a simple, extensible, and open virtual storage platform. In this context, it abstracts storage from physical arrays into a single pool of virtual storage while maintaining the unique capabilities of the underlying arrays. The net result is that it allows data to be stored in the traditional enterprise outside of the traditional file system boundaries, allowing big data to be stored. This enables a single object platform spanning multiple storage arrays, and even multiple data centers. This platform then enables higher level data services with which global applications can act upon. A deeper dive on the ViPR data services platform can be found here. The following architecture diagram describes the ViPR stack, turning physical storage into a ubiquitous object storage platform.
Hadoop and EMC Pivotal
Hadoop is an open source Apache project designed to be a java-based programming framework that supports the processing of large data sets in a distributed computing environment. One of the main benefits of this architecture is that it provides a low-cost, scale-out, and natively fault-tolerant infrastructure with which to run higher-level applications on. Current Apache Hadoop ecosystem consists of the Hadoop kernel, MapReduce, Hadoop distributed file system (HDFS) and a number of related projects such as Apache Hive, HBase and Zookeeper. EMC Pivotal instantiates this architecture as follows:
MapReduce is designed as a computational framework that allows developers to process massive amounts of unstructured data without having to worry about where the data goes. It is divided into a “map” and a “reduce” function, where the workload is divided among commodity servers, often called “worker nodes”. For a great primer on Hadoop, see this video by Dr. Patricia Florissi of the EMC ASD organization.
ViPRFS HDFS Integration
ViPR Data Services is designed to create an object layer over heterogeneous storage through an abstraction layer. Applications connect via a data path REST protocol using the Amazon S3, EMC Atmos, or Swift API. The latest release of ViPR integrates this data services API with Hadoop, by instantiating a java client jar file on each worker node as a Hadoop Compatible File System (HCFS). The jar file is simply put anywhere in the hadoop classpath on each node. The following diagram depicts the ViPRFS client running on each of the data, or worker nodes.
By then modifying the core-site.xml to add the ViPRFS java classes, defining the ViPRFS URI, and a few other configuration properties, we then allow the Pivotal cluster to point to the ViPR HDFS implementation. The MapReduce framework can then read and write data directly on the ViPRFS file system. Standard HDFS CLI commands are completely transparent as well.
Below is a listing of the core-site.xml changes. Note that ViPRFS is completely transparent to a cluster that also runs hadoop applications on a standard HDFS filesystem. One would simply specify the full ViPRFS URI on the MapReduce job. In the below example, “viprfs://hdfs1.namespace.ViPR” is the HDFS-compatible URI for this cluster.
core-site.xml:
<?xml version=”1.0″ encoding=”UTF-8″?>
<configuration>
<!–Begin VIPRFS Changes –>
<property>
<name>fs.defaultFS</name>
<value>viprfs://hdfs1.namespace.ViPR</value>
</property>
<property>
<name>fs.vipr.installations</name>
<value>ViPR</value>
</property>
<property>
<name>fs.vipr.installation.ViPR.hosts</name>
<value><ViPR Data Node IP address(es)</value>
</property>
<property>
<name>fs.viprfs.impl</name>
<value>com.emc.hadoop.fs.vipr.ViPRFileSystem</value>
</property>
<property>
<name>fs.AbstractFileSystem.viprfs.impl</name>
<value>com.emc.hadoop.fs.vipr.ViPRAbstractFileSystem</value>
</property>
<!–End ViPRFS Changes –>
We then put the entire stack together end-to-end, and come up with the following architecture below.
Running an HDFS CLI command against ViPRFS show the HDFS-compatible URI. The below command lists the contents of the root bucket specified on ViPR, in this case the object bucket name is ‘hdfs1’. Note that it specifically calls out the ViPRFS URI, so if the native hdfs can be left as the ‘fs.defaultFS’ parameter.
$ hdfs dfs –ls /
INFO vipr.ViPRFileSystem: Initialized ViPRFS for viprfs://hdfs1.namespace.ViPR Found 4 items
drwx—— – root 0 2013-12-06 23:14 /input
drwx—— – root 0 2013-12-06 23:18 /output
drwx—— – root 0 2013-12-06 23:08 /user
drwx—— – root 0 2013-12-06 23:08 /yarn
Putting it All Together
By taking an enterprise scale-out data services architecture, and pairing that with a scale-out compute infrastructure that is EMC Pivotal, it leads to some really cool use cases. For example, using a simple S3Fox Firefox browser plug-in, I can upload datasets from any browser into a ViPR data store, and then act upon that dataset(s) from my Pivotal Hadoop environment. The screenshots below show the upload from the Browser, reading the dataset (moby-dick.txt), and then running a MapReduce Wordcount job on the file, counting the number of times each word appears in the book Moby Dick.
# hdfs dfs -ls viprfs://hdfs1.ns.MSW/
13/12/13 22:20:37 INFO vipr.ViPRFileSystem: Initialized ViPRFS for viprfs://hdfs1.ns.MSW/
Found 3 items
-rw——- 1 root 274 2013-12-13 11:07 viprfs://hdfs1.ns.MSW:9040/10-words.txt
drwx—— – root 0 2013-12-13 11:04 viprfs://hdfs1.ns.MSW:9040/input
-rwx—r– 1 root 1234589 2013-12-13 22:04 viprfs://hdfs1.ns.MSW:9040/moby-dick.txt
And here’s the really cool part…
We ran the MapReduce job using the ViPRFS URI object bucket as the input, and specified the native local cluster HDFS as the output.
$ hadoop jar hadoop-mapreduce-examples.jar wordcount viprfs://hdfs1.ns.MSW:9040/moby-dick.txt hdfs://lviprk128.lss.com:8020/output2 1
3/12/14 05:22:35 INFO vipr.ViPRFileSystem: Initialized ViPRFS (atom 1.0.0.0.771) for viprfs://hdfs1.ns.MSW:9040/moby-dick.txt
13/12/14 05:22:35 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is inited.
13/12/14 05:22:35 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is started.
13/12/14 05:22:36 INFO input.FileInputFormat: Total input paths to process : 1
13/12/14 05:22:36 INFO mapreduce.JobSubmitter: number of splits:1
<…>
13/12/14 05:22:48 INFO mapreduce.Job: map 0% reduce 0%
13/12/14 05:22:58 INFO mapreduce.Job: map 100% reduce 0%
13/12/14 05:23:07 INFO mapreduce.Job: map 100% reduce 100%
13/12/14 05:23:07 INFO mapreduce.Job: Job job_1386903646746_0004 completed successfully
$ hdfs dfs -ls hdfs://lviprk128.lss.com:8020/output2 Found 2 items
-rw-rw-rw- 1 gpadmin hadoop 0 2013-12-14 05:23
hdfs://lviprk128.lss.com:8020/output2/_SUCCESS
-rw-rw-rw- 1 gpadmin hadoop 366601 2013-12-14 05:23
hdfs://lviprk128.lss.com:8020/output2/part-r-00000
The world has definitely changed.