Hadoop Grows Up: How Enterprises Can Successfully Navigate its Growing Pains

If you’d asked me 10 years ago whether enterprises would be migrating to Hadoop, I would’ve answered with an emphatic no. Slow to entice enterprise customers and named after a toy elephant, at first glance, the framework didn’t suggest it was ready for mass commercialization or adoption.

But the adoption of Hadoop among enterprises has been phenomenal. With its open-source software framework, Hadoop provides enterprises with the ability to process and store unprecedented volumes of data – a capability today’s enterprise sorely needs – effectively becoming today’s default standard for storing, processing and analyzing mass quantities, hundreds of terabytes or even petabytes of data.

While the adoption and commercialization of Hadoop is remarkable and an overall positive move for enterprises hungry for streamlined data storage and processing, enterprises are in for a significant challenge with the migration from Hadoop 2.0 to 3.X.

Most aren’t sure what to expect, and few experienced the earlier migration’s pain points. Though Hadoop has “grown up”, in that it is now used by some of the world’s largest enterprises, it hasn’t identified a non-disruptive solution when it jumps major releases.

Happening in just a few short years, this next migration will have dramatic implications for the storage capabilities of today’s insurance companies, banks and largest corporations. It’s imperative that these organizations begin planning for the change now to ensure that their most valuable asset—their data—remains intact and accessible in an “always on” culture that demands it.

Why the Migration Mattersmigration_2

First, let’s explore the significant benefits of the migration and why, despite the headaches, this conversion will ultimately be beneficial for enterprises.

One of the key benefits of Hadoop 3.X is erasure coding, which will dramatically decrease the amount of storage needed to protect data. In a more traditional system, files are replicated multiple times in order to protect against loss. If one file becomes lost or corrupted, its replica can easily be summoned in place of the original file or datum.

As you can imagine, replication of data requires significant volumes of storage that can shield against data failure, but is expensive. In fact, default replication requires an additional 200 percent in storage space and other resources, such as network bandwidth when writing the data.

Hadoop 3.X’s move to erasure coding resolves the storage issue while maintaining the same level of fault tolerance. In other words, erasure coding helps protect data as effectively as traditional forms of coding but takes up far less storage. In fact, erasure coding is estimated to reduce the storage cost by 50 percent – a huge financial boon for enterprises moving to Hadoop 3.X. With Hadoop 3.X, enterprises will be able to store twice as much data on the same amount of raw storage hardware.

That being said, enterprises updating to Hadoop 3.X will face significant roadblocks to ensure that their data remains accessible and intact during a complicated migration process.

Anticipating Challenges Ahead

For those of us who experienced the conversion from Hadoop 1.X to Hadoop 2.X, it was a harrowing one, requiring a complete unload of the Hadoop environment data and a complete re-load onto the new system. That meant long periods of data inaccessibility and, in some cases, data loss. Take a typical laptop upgrade and multiply the pain points thousand-fold.

Data loss is no longer a tolerable scenario for today’s enterprises and can have huge financial, not to mention reputational implications. However, most enterprises adopted Hadoop after its last revamp, foregoing the headaches associated with major upgrades involving data storage and processing. These enterprises may not anticipate the challenges ahead.

The looming migration can have potentially dire implications for today’s enterprises. A complete unload and re-load of enterprises’ data will be expensive, painful and fraught with data loss. Without anticipating the headaches in store for the upcoming migration, enterprises may forego the necessary measures to ensure the accessibility, security and protection of their data.

Navigating the Migration Successfully

Isilon_Hadoop_2The good news is that there is a simple, actionable step enterprises can take to manage migration and safeguard their data against loss, corruption and inaccessibility.

Enterprises need to ensure that their current system does not require a complete unload and reload of their data. Most systems do require a complete unload and reload, so it is crucial that enterprises understand their current system and its capabilities when it comes to the next Hadoop migration.

If the enterprise were on Isilon for Hadoop, for example, there would be no need to unload and re-load its data. The enterprise would simply point the newly upgraded computer nodes to Isilon, with limited downtime, no re-load time and no risk for data loss.

Isilon for Hadoop helps enterprises ensure the accessibility and protection of their data through the migration process to an even stronger, more efficient Hadoop 3.X. While I’m eager for the next revamp of Hadoop and its tremendous storage improvements, today’s enterprises need to take precautionary measures before the jump to protect their data and ensure the transition is as seamless as possible.

About the Author: Keith Manthey

Keith is the CTO with a passion for High Performance Computing, Financial Services, and Analytics for Dell EMC. He brings more than 24+ years of Identity Fraud Analytics, high performance computing, and Financial Systems experience. Keith holds numerous patents in high performance computing and analytics and is an advisory board member of the University of Georgia’s Management of Information Systems School.