7 best practices for setting up Hadoop on an EMC Isilon cluster

If you’re considering adding an Apache™ Hadoop® workflow to your EMC® Isilon® cluster, you’re probably wondering how to set it up. The new white paper “EMC Isilon Best Practices for Hadoop Data Storage” provides useful information for deploying Hadoop in your Isilon cluster environment.

The white paper also introduces the unique approach that Isilon took to Hadoop deployments. In a typical Hadoop deployment, large unstructured data sets are ingested from storage repositories to a Hadoop cluster based on the Hadoop distributed file system (HDFS). Data is mapped to the Hadoop DataNodes of the cluster and a single NameNode controls the metadata. The MapReduce software framework manages jobs for data analysis. MapReduce and HDFS use the same hardware resources for both data analysis and storage. Analysis results are then stored in HDFS or exported to other infrastructures.

Traditionl Hadoop Deployment

In an EMC Isilon Hadoop deployment, the HDFS is integrated as a protocol into the Isilon distributed OneFS® operating system. This approach gives users direct access through the HDFS to data stored on the Isilon cluster using standard protocols such as SMB, NFS, HTTP, and FTP. MapReduce processing and data storage are separated, allowing you to independently scale compute and data storage resources as needed.

EMC Isilon Hadoop Deployment

Every node in the Isilon cluster acts as the NameNode and DataNode. Compute clients running MapReduce jobs can connect to any node in the cluster. Data analysis results can be accessed by Hadoop users through standard protocols without the need to export results.

To learn more about the benefits of Hadoop on Isilon scale-out network attached storage (NAS), read “Hadoop on EMC Isilon Scale-Out NAS” and “EMC Isilon Scale-Out NAS for In-Place Hadoop Data Analytics.”

Best practices for deploying Hadoop to your Isilon cluster

You can connect Apache Hadoop or an enterprise-friendly Hadoop distribution, such as Pivotal HD or Cloudera, to your Isilon cluster.

First, you’ll need to turn on the HDFS protocol in OneFS. Contact your account representative to complete this step. Next, follow these best practices:

  1. Review the EMC Hadoop Start Kit 2.0. Visit the EMC Hadoop Starter Kit (HSK) 2.0 for step-by-step guides on how to connect a Hadoop distribution to your Isilon cluster. HSK guides are available for Apache Hadoop, Pivotal HD, Cloudera, and Hortonworks. A video demonstration for Pivotal HD is also available.
  2. Find your Isilon cluster’s optimal point to help determine the number of nodes that will best serve your Hadoop workflow and compute grid. The optimal point is the point at which it scales in processing MapReduce jobs and reduces run times in relation to other systems for the same workload. Contact your account representative to help you determine this information.
  3. Create directories and set permissions. OneFS controls access to directories and files with POSIX mode bits and access control lists (ACLs). Make sure directories and files are set up with the correct permissions to ensure that your Hadoop users can access their files.
  4. Don’t run NameNode and DataNode services on clients. Because the Isilon cluster acts as the NameNode and DataNodes for the HDFS, these services should only run on the cluster and not on compute clients. On compute clients, you should only run MapReduce processes.
  5. Increase the HDFS block size from the default 64 MB to 128 MB to optimize performance. Boosting the block size lets Isilon nodes read and write HDFS data in larger blocks. The result is an increase in performance of MapReduce jobs.
  6. Store intermediate jobs on an Isilon cluster. A Hadoop client typically stores its intermediate map results locally. The amount of local storage available on a client affects its ability to run jobs. Storing map results on the cluster can help performance and scalability.
  7. Consult the Isilon best practices white paper for additional tips. You can find more details about some of these best practices in “EMC Isilon Best Practices for Hadoop Data Storage.” You can also find additional tips for tuning OneFS for HDFS operations, using EMC Isilon SmartConnect™ for HDFS, aligning datasets with storage pools, and securing HDFS connections with Kerberos.

If you have questions related to Hadoop and your Isilon environment, contact your account representative. If you have documentation feedback or want to request new content, email isicontent@emc.com.

[display_rating_result]

About the Author: Kirsten Gantenbein