EMC and RainStor Optimize Interactive SQL on Hadoop

Pivotal HAWQ was one of the most groundbreaking technologies entering the Hadoop ecosystem last year through its ability to execute complete ANSI SQL on large-scale datasets managed in Pivotal HD. This was great news for SQL users – organizations heavily reliant on SQL applications and common BI tools such as Tableau and MicroStrategy can leverage these investments to access and analyze new data sets managed in Hadoop.

Similarly, RainStor, a leading enterprise database known for its efficient data compression and built-in security, also enables organizations to run ANSI SQL queries against data in Hadoop – highly compressed data.  Due to the reduced footprint from extreme data compression (typically 90%+ less), RainStor enables users to run analytics on Hadoop much more efficiently.  In fact, there are many instances where queries run significantly faster with a reduced footprint plus some filtering capabilities that figure out what not to read.  This allows customers to minimize infrastructure costs and maximize insight for data analysis on larger data sets.

Serving some of the largest telecommunications and financial services organizations, RainStor enables customers to readily query and analyze petabytes of data instead of archiving data sets to tape and then having to reload it whenever it is needed for analysis. RainStor chose to partner with EMC Isilon scale-out NAS for its storage layer to manage these petabyte-scale data environments even more efficiently. Using Isilon, the compute and storage for Hadoop workload is decoupled, enabling organizations to balance CPU and storage capacity optimally as data volumes and number of queries grow.

Rainstor

Furthermore, not only are organizations able to run any Hadoop distribution of choice with RainStor-Isilon, but you can also run multiple distributions of Hadoop against the same compressed data. For example, a single copy of the data managed in Rainstor-Isilon can service Marketing’s Pivotal HD environment, Finance’s Cloudera environment, and HR’s Apache Hadoop environment.

To summarize, running RainStor and Hadoop on EMC Isilon, you achieve:

  • Flexible Architecture Running Hadoop on NAS and DAS together: Companies leverage DAS local storage for hot data where performance is critical and use Isilon for mass data storage. With RainStor’s compression, you efficiently move more data across the network, essentially creating an I/O multiplier.
  • Built-in Security and Reliability: Data is securely stored with built-in encryption, and data masking in addition to user authentication and authorization. Carrying very little overhead, you benefit from EMC Isilon FlexProtect, which provides a reliable, highly available Big Data environment.
  • Improved Query Speed: Data is queried using a variety of tools including standard SQL, BI tools Hive, Pig and MapReduce. With built-in filtering, queries speed-up by a factor of 2-10X compared to Hive on HDFS/DAS.
  • Compliant WORM Solution: For absolute retention and protection of business critical data, including stringent SEC 17a-4 requirements, you leverage EMC Isilon’s SmartLock in addition to RainStor’s built-in immutable data retention capabilities.

I spoke to Jyothi Swaroop, Director of Product Marketing at Rainstor, to explain the value of deploying EMC Isilon with RainStor and Hadoop.

1.  RainStor is known in the industry as an enterprise database architected for Big Data. Can you please explain how this technology evolved and what needs it addresses in the market?

The RainStor archiving solution evolved out of work that started over 10 years ago to find novel ways to efficiently compress and retrieve structured data. At the time, the requirement was to get the best space/time performance out of under-powered military hardware (the initial product was funded by the UK Ministry of Defence).

Because this project was so successful, RainStor began extending this solution to enterprises challenged with the rising costs of storing historical data. Today, with the explosion of Big Data, the requirement is similar – squeezing more capability out of commodity hardware, especially if you are a Financial Services, Telco, and Government organization requiring efficient storage and access to petabytes of structured, historical data.

Netezza, Teradata, and other heavy-duty data warehousing platforms do not address the three things RainStor does address really well – Cost, Complexity, and Compliance. We already addressed ‘cost’ with RainStor’s efficient data compression. In terms of ‘complexity’, these other data warehousing platforms are complex to manage and overkill since historical data sets may not be frequently accessed and analyzed. With RainStor, it is easy to deploy and manage, especially with EMC Isilon scale out NAS as the storage layer. ‘Compliance’ is addressed right out of the box. RainStor has built in compliance features so organizations don’t need to spend time gathering information or hire compliance consultants each time a new mandate arises.

2.  How do you augment RainStor into an organization’s data environment? How does it work?
RainStor’s archiving solutions first went GA in 2008. Over that time we’ve learned a great deal about what is required to be a full-time member of an enterprise’s data ecosystem. The answer is to ensure that we follow open standards in terms of the integration points between RainStor and the rest of the enterprise. Take security, for example, RainStor fully integrates with the LDAP and ActiveDirectory authentication services common in large organizations. From a reporting perspective, RainStor has ODBC and JDBC drivers that ensure that business users can employ the same tools and techniques they’ve always used to access data stored in RainStor.

3.  RainStor is storage platform agnostic – can be deployed on Hadoop, EMC Isilon, and even hybrid – Hadoop with EMC Isilon . Can you please explain the optimal use cases for all 3 configurations?

RainStor on Isilon is our recommended configuration for all NAS deployments. The scale-out nature of storage here matches RainStor’s scale-out compute approach. For customers wanting to archive massive amounts of structured data for compliance purposes, or as a tape archiving replacement, this is the one to go for. The high storage density and utilization of Isilon, together with RainStor’s data compression and our analytic SQL query capabilities make for a very compelling solution.

When the application has analytics requirements that depend on tools from the Hadoop ecosystem, then we’d recommend matching up RainStor with Hadoop on Isilon. You get all the benefits of the pure RainStor and Isilon solution, but with the added capabilities around being able to access RainStor data via Hive, Pig or MapReduce.

The final hybrid configuration – RainStor on Hadoop DAS with Rainstor on Isilon NAS – provides the ability to tier data. This is a good match for existing Hadoop users that want to take advantage of the storage features that are unique to Isilon, such as replication, snapshotting and compliance, in addition to Isilon’s high storage utilization and density. So for example, a Financial Services organization may have unstructured social media data on Hadoop DAS and structured sensitive financial data on Isilon NAS.

4.  Hadoop is often criticized for lack of security and privacy. Please explain how RainStor provides a multi-layer security to minimize security threats and enhance regulatory compliance.

The flaws in Hadoop’s security model are real and well publicized. RainStor tackles this problem by extending our enterprise-proven security framework to RainStor data on Hadoop. The same encryption, data masking, authentication, role-based security and auditing capabilities our RainStor on Isilon NAS deployments enjoy are made available to RainStor running on Hadoop. This includes MapReduce, Hive and Pig jobs that run through Hadoop over RainStor data.

5.  Can you provide an optimal use case for RainStor with Hadoop and Isilon?

Data warehouse offload is a classic RainStor on Isilon use case, providing a stable and proven online archiving solution over tape archiving. In one recent example, a banking customer offloaded 120TB of data from Netezza into RainStor on Isilon. The footprint of the data on Isilon was only 5TB after RainStor’s compression was applied. By freeing up this space from the source data warehouse by offloading the older records to RainStor, the bank are now able to apply the specialized analytics capabilities of Netezza on their more current data. They can still access the older records in RainStor using the same SQL reports they run on Netezza – indeed, some queries were 3 times faster in RainStor, than on the source warehouse. Speed and efficiency are real benefits realized.

Analytical archiving with Hadoop is another use case where data is generated and needs to be analyzed at high speeds such as Telco CDRs. With RainStor and Isilon, you are able to quickly capture, compress, and encrypt for subsequent analysis with Hadoop. compliance archiving and reviving tape archives are other use cases.

About the Author: Mona Patel