Navigating a Data Lake


Here in Seattle, we have a stunning lake on the edge of our downtown called Lake Union. The lake is home to many houseboats, including the one filmed in “Sleepless in Seattle,” as well as a haven for sailboats, kayakers and sea planes – in short, a true beehive of activity!

Even though the lake can be crowded, Seattle does a great job of managing activity on the lake. Restrictions on the number of house boats, designated landing areas for sea planes, and police patrol boats all work together to help ensure that everything moves in an orderly fashion. I can’t help but think about the parallels between what happens on Lake Union on a daily basis and what is transpiring in the emerging world of what is referred to as a “data lake.”

A data lake is a repository for all kinds of data. Data can be placed in the lake through a variety of means. That same data can be consumed through different mechanisms without needing to copy or export anything. Ultimately, data lakes are an order of magnitude more scalable than existing approaches for data warehousing and business analytics. However, in order to ensure seamless, predictable and efficient capacity given the amazing rate of information growth, a data lake must above all else be built to be able to scale. As businesses learn to harness their information, data lakes and their applications take on strategic importance. The data lake must be able to enable existing applications, as well as seamlessly support new applications. It is also increasingly important to protect and backup the data lake efficiently, to ensure that it interacts with directory and security services, and be something that you can manage easily over time.

Within EMC, we at Isilon have been focusing on developing some of these capabilities. Over the last couple of years, we’ve been enhancing the OneFS operating system and collaborating with key partners to ensure that our customers can effectively manage their data lakes. If any of you grew up around lakes, you probably remember finding a solid foundation to dive from, and then once you were comfortable, climbing to higher ground and really taking a deep plunge!  We’re using the same philosophy with our approach to data lakes. We have maintained a strong footing in our traditional offerings around enterprise file applications such as archive, home directories and HPC, while expanding and building new solutions for mobile, cloud, analytics and software-defined storage.

In addition, by natively incorporating the Hadoop Distributed File System (HDFS) into OneFS, companies are now able to bring Hadoop to their Big Data rather than vice versa. HDFS allows enterprises to avoid the CapEx costs of purchasing a separate infrastructure and start getting results faster because they don’t need to spend time moving PBs of data. They can also access home directory and files shares contained in their data lakes, from virtually any mobile device using Syncplicity technology.

While other parts of EMC are focusing on complementary capabilities related to data lakes, these are just a few of the areas where we at Isilon are helping folks to successfully realize the possibilities that exist.

Continue Reading
Would you like to read more like this?

Related Posts

Click to Load More
All comments are moderated. Unrelated comments or requests for service will not be published, nor will any content deemed inappropriate, including but not limited to promotional and offensive comments. Please post your technical questions in the Support Forums or for customer service and technical support contact Dell EMC Support.
  • Santhosh680

    Data lake concept is great but what about the performance issues, I mean managing part is not a problem but extracting information in a faster manner compare to any other existing clustered file systems for scale out NASs. Correct me if am wrong. Am new to this area at enterprise level.

    • Bill Richter

      We agree! Performance and fast access to data in the EMC Isilon Data Lake is as important as the ease of management and the scaling of capacity. As I tried to depict with my description of Lake Union, an Isilon Data Lake is the hub of an enterprises’ unstructured data activity. It’s not a “Lake Placid!” Instead, the Data Lake is an active source of data for today’s file-based applications, as well as for new and emerging workflows in mobility, analytics and cloud. And when it comes to performance, Isilon can scale to over 100 GB/sec of aggregate throughput and an industry leading 1.6M specSFS IOPS/sec!

  • cory minton

    Bill is spot on. While the idea that having dedicated infrastructure for a Hadoop environment might seem like they may be able to perform queries more quickly, that view negates the more holistic view of time to value. In most Hadoop environments, a landing zone (staging area, scratch space, what have you…) is required to get the data into an environment or to serve the data\’s other purposes via alternative protocols (think data written by a mobile app using REST API for storage puts that we then want to query against). The data then has to be migrated into the analytical environment in order to be queried. The value (beauty, elegance, serendipitous nature of its scale-out architecture that leverage a cluster of nodes…) is that the data is able to be queried in place leveraging the native HDFS protocol access. This means data can be sourced/served by the other standard NAS protocols like CIFS, NFS, HTTP, FTP and REST, yet needs not be migrated to a separate silo of spinning disks. So, data lands on Isilon, it\’s immediately able to be leveraged by Hadoop queries from your favorite flavor/distribution of Hadoop. Compute resources (virtual or physical) can be scaled independently of storage capacity which aligns costs to needs more effectively. And let\’s not forget that we also bring kerberized authentication, name node redundancy, replication, and data protection to Hadoop in ways not currently available in standard DAS deployments. Lots of fun stuff to talk about in the Data Lake…