Moving Beyond Proof of Concept: Optimizing AI Infrastructure for Large Scale Production Deployments

A few weeks ago I was interviewed by Roger Magoulas, VP of O’Reilly Media, at the O’Reilly Artificial Intelligence Conference in San Jose. Our conversation focused on moving beyond the Artificial Intelligence buzz – how organizations can actually design and deploy the optimal IT infrastructure for different AI use cases as they try to move their proof of concept AI work into real production environments. With initiatives of this nature, it’s important to consider how AI drives the demand for higher processing power and throughput. I’ve embedded the video of our interview below.

As I discussed in the video, there are two main categories of AI use cases: Machine Learning (ML) and Deep Learning (DL). The use case processing characteristics are quite different, each requiring specific compute and storage envelopes of performance and scale, as highlighted in Figure 1 below.

Figure 1: Compute and Storage requirements of ML and DL use cases

ML pipelines are typically comprised of semi-structured data fed by machines (servers, mobile phones, IoT sensors, etc.) with datasets ranging in size from tens to hundreds of terabytes to maybe even a petabyte or two. ML workloads can be adequately serviced by hundreds of servers up to a few thousand servers. However, it’s a wholly different scenario with DL. These datasets are predominantly unstructured data such as images, video and audio content typically expanding to multiple petabytes, requiring many thousands of compute clusters for processing and may justify GPU investment to cut down on the data center cost and footprint.

The location of an organization’s data must also be considered carefully while designing production deployment architecture for AI platforms. In general, it’s my recommendation that you build your AI platforms in the same location as your data. If the majority of your data is generated in the cloud, then it may make sense to run your AI workflows there too as you’d likely face substantial data egress charges to move data on-prem. On the other hand, if your data resides on-prem, then you may deploy AI platforms on-prem. This minimizes the cost of managing the data and the latency in accessing it as well as the need to run data migration projects as a precursor to data analytics.

What are other key considerations as you move from PoC to production for your ML and DL workloads? In my work with customers across many industries, I see these common themes:

  1. Data Consolidation – it is cumbersome to do analytics with data scattered across an organization. It is a better practice to consolidate the data ideally in one location but more practically in just a few.
  2. Decouple compute from storage – an organization’s data may not change significantly over time. However, the applications and tools used to analyze it can change. Therefore, it makes sense to separate compute and storage. Doing so allows you to point evolving server-based applications and tools to where the data is located without moving data around as your compute needs change.
  3. Storage Scaling – as data matures, its value may change. This is especially true for historical data. Capacity scaling should design for this dynamic in order to remain cost-effective.
  4. Data Governance – as AI becomes more prevalent, the need for protecting and securing the data also gains importance. Data quality, security, protection, lineage and metadata tracking, as well as considerations taken for granted in the Business Intelligence world, are key in the AI world as well.

Today, data consolidation paradigms have shifted to the concept of a Data Lake. We define Data Lake as an architectural paradigm that enables us to consolidate enterprise data, enabling storage to scale independently from compute with the ability to support Analytics and AI applications with varying IO signatures, performance requirements and data governance capabilities delivered out of the box. With our industry-leading Dell EMC Isilon scale-out NAS platforms, we’ve been driving this idea for some time now as a way to store and manage exploding volumes of unstructured data. However, to us Data Lake is not just a marketing buzzword. We put real architectural structure and requirements behind it. One key requirement in building a Data Lake is the ability to support multiple access protocols and applications with differing characteristics, real-time or batch mode, and with varying latency needs. Another is the ability to easily and efficiently access data with differing temperatures, whether it is “hot” or “cold.” To transparently archive data for AI platforms, we also offer Dell EMC ECS – our flagship distributed object store.

Dell Technologies has helped many customers around the world to unlock the value of their Data Capital for Digital Transformation. With our broad AI platform portfolio, I’m certain we can assist your organization in its journey to AI.

Want to learn more about key infrastructure decisions to contemplate as you move forward in your AI journey? View the Moor Insights & Strategy report: Enterprise Machine & Deep Learning with Intelligent Storage.

About the Author: Sai Devulapalli

Sai Devulapalli is an accomplished leader with 21 years of broad experience in building, launching and managing B2B product and solution portfolios and services lines of business, forging strategic partnerships and aligning portfolio offerings in major acquisitions. His domain expertise is in Data Analytics, Internet of Things, Enterprise PaaS in IT and Telecom industries. Devulapalli is currently responsible for Analytics and Hadoop lines of business for emerging storage portfolio for Dell EMC where he manages the global business, portfolio and partnerships and the integration of former Dell and EMC product offerings with industry-leading Analytics and Hadoop stacks into a cohesive solution portfolio.