Pandas, Caviar, and Deep Learning

Analytics is redefining the world. Data is the new oil. Artificial intelligence (AI) is everywhere

It feels like we hear some variant of these phrases too often. But there is a reason why such statements have become so pervasive. Data has quickly become the most valuable and differentiating asset for many organizations as it opens up new revenue streams and unearths opportunities for improvement in almost every part of an organization.

Innovative companies are devoting an ever-increasing amount of resources to collect, store and analyze data – in order to squeeze every drop of value out of this asset. And the companies that haven’t been investing in advanced analytics technologies, like deep learning, risk exposing themselves to major disruptions, forcing them to quickly evolve or risk extinction.

As one would expect with any early-stage technology, as organizations prioritize analytics, the number of toolkits and use cases have proliferated. At times, it feels like there’s a new analytics tool or data framework every day. And right now, the area that is heating up the fastest AI, specifically deep learning using a more iterative and layered approach for things like image classification and natural language processing.

Data is at the heart of AI

From what I’ve seen, how an organization utilizes its DATA continues to be the differentiation that separates the leaders from the rest of the pack. To be more precise, it’s in the types of data, the depth of data, and the uniqueness of data available for analysis.

Unstructured data provides organizations the opportunity to significantly move the needle as it exposes rich unique data sets consisting of images, video, and streaming IoT data that previously has gone untapped in the data analytics space. The bad news is that the data requirements for deep learning (DL) models are quite different than the traditional data that organizations are used to dealing with. For example, most DL use cases are image and video heavy workloads which are huge in data size and virtually incompressible. In my experience, I’ve only been able to get up to 4.6 percent compression using the most extreme lossy algorithm on ImageNet which doesn’t allow much in the way of savings or efficiency when managing billions or trillions of files.   Additionally, the advent of GPUs to handle highly iterative deep learning models from frameworks like TensorFlow have added concurrency as a new infrastructure requirement as files are often read millions of times by a single layer of a convolution neural network (CNN).

These extreme DL requirements for performance, scale, and flexibility don’t neatly conform to traditional block storage boundaries. Luckily, game-changing innovations in unstructured storage as well as computing (rise of the GPU) paired with deep learning networks for image recognition or natural language processing, are finally enabling access to and analysis of data that was hard to divine much insight from a decade ago.

AI infrastructure:  Pandas vs Caviar

As anyone who has taken Andrew Ng’s Deep Learning AI course will tell you, Andrew argues that it’s less about the algorithms and math; he asserts it’s all about the data. Data curation, data engineering, data labeling, and data management of the sprawling infrastructure to support the 100’s of terabytes (TB) to 100’s petabytes (PB) of pictures, videos, and streaming sensor data. Thus, in the AI space, one of the biggest challenges is managing all the unstructured data with the right infrastructure stack to both persist PBs of data and utilized in the data-hungry AI models.

To quote Andrew Ng again, the infrastructure choices are all about pandas and caviar.

The “Raising Pandas” approach refers to the fact many adult pandas babysit and care for one panda baby at a time, or in this case focus on training one model at a time.  For example, if you’re a smaller company or you’re just starting on your AI journey, you might only have the infrastructure and computational capacity to train, measure, and tune a single model at a time.  This is a great solution if you only have one data scientist, if there aren’t not many use cases for deep learning in your organization, or if the model requires a modest amount of data (<100 TB) of data to train the algorithm.

“Caviar refers to laying thousands of eggs and training many models in parallel, then picking the model with the best learning curve.  This approach is often required if petabytes of data are involved in the solution like in the case of autonomous driving and fraud detection. These solutions require a large-scale and complex distributed computing environment to simultaneously training multiple deep learning models to find a best-fit model. In the more advanced cases, these can even be automated and managed with a container-based solution like BlueData to provide an elastic “as-a-service” approach.

Dell EMC paves the road

Dell EMC is at the forefront of AI with decades of experience in high performance computing (HPC) and big data developing the technology, experience, and expertise to deliver best-of-breed products and end-to-end solutions that accelerate AI initiatives. Whether you’re raising a panda or caviar organization, deep learning has a diverse set of requirements with various compute, memory, I/O and disk capacity profiles but at the end of the day, it all starts with the data.

Dell EMC Isilon F800 All-Flash Scale-out NAS is uniquely suited for modern deep learning delivering the flexibility to deal with any data type, scalability for datasets ranging in the PBs, and concurrency to support the massive concurrent I/O request from the GPUs.  Isilon’s scale-out architecture eliminates the I/O bottleneck between storage and compute, allowing you to start with 10’s of TBs of data with up to 15 GBs bandwidth and then you can scale-out up to 68 PB with up to 540 GBs of performance in a single cluster.  This allows Isilon to accelerate AI innovation with faster model training, provide more accurate insights with deeper data sets, and deliver a higher ROI by fully saturating the data requests of up to 1000’s of GPU’s per cluster.

To make this even simpler for you, Dell EMC has taken the guesswork out of delivering AI stacks with their new Dell EMC Ready Solutions for AI. The deep learning with NVIDIA scale-out design features Dell EMC PowerEdge servers with NVIDIA GPU acceleration and Dell EMC Isilon to serve up massive amounts of data for faster, deeper insights.  These solutions start small and then can non-disruptively scale by adding the compute and/or storage to meet your AI needs.

Summary

Regardless of your data types and the means of managing value from your data, remember that the companies with the most innovative and actionable DATA win. As your AI requirements mature from pandas to caviar, the infrastructure decisions you make today will have a major impact on your business tomorrow.

If you’d like to learn more about why Isilon is the right data platform for deep learning you can refer to the Isilon Analytics whitepaper to learn more about architecting AI workloads and review the results of multiple industry-standard image classification benchmarks using TensorFlow. We’d love to hear how you are forging ahead on this path and finding different ways around these challenges.

About the Author: Keith Manthey

Keith is the CTO with a passion for High Performance Computing, Financial Services, and Analytics for Dell EMC. He brings more than 24+ years of Identity Fraud Analytics, high performance computing, and Financial Systems experience. Keith holds numerous patents in high performance computing and analytics and is an advisory board member of the University of Georgia’s Management of Information Systems School.