XtremIO’s In-Memory Metadata Architecture – Fact and Fiction

During our launch last week we discussed the amazing benefits of XtremIO’s In-Memory Metadata architecture.  Some have seen fit to FUD this approach as risky – what happens to valuable metadata if a controller (or power) fails and memory contents are lost?  Believe it or not, we did think about these things when designing XtremIO.  So let us clear the air – in-memory metadata is a run-time capability of the array that significantly boosts performance. Metadata is not exclusively kept in memory.  It is also journaled, protected, and hardened to SSD and can tolerate any failure event in the array.  We couldn’t cover every detail of this during a one-hour launch event, so here’s what we didn’t have time to say last week.

To set the foundation, let’s briefly review the concept of metadata. In the context of storage systems, metadata is simply useful internal information managed by the array to describe and locate user data.  All modern arrays abstract the physical media and present logical (virtualized) addresses to clients in the form of LUNs.  The mapping between the logical address and physical address is a form of metadata that the array needs to manage.  That’s typically the most common form of metadata for SAN storage systems.  Newer architectures manage additional metadata to implement additional capabilities. For example, snapshots, change tracking for efficient remote replication, deduplication pointers, and compression all involve managing some form of metadata.

Computers operate on information in memory, so by definition if a system is performing metadata operations then the metadata must be in memory.  So why are we making a big deal about our in-memory metadata architecture?  That’s where the related concept of “working set” comes in.  The working set can be defined as the collection of information the system needs to operate for a given time interval.  In the current context, that’s all the metadata needed to service client data requests within some acceptable response time.  For All-Flash Arrays that “acceptable response time” is typically below one millisecond.

Depending on architecture and implementation, not all metadata may fit into memory.  If some metadata is part of the working set but not in memory, the array must first fetch it from media, then use it to figure out the next step, and finally perform the client data operation.  This adds latency to operations and creates inconsistencies in response times.  It also causes multiple accesses to the back-end array (the flash or SSDs) for every host I/O operation.  Under load, this added latency and extra back-end load can push systems beyond the acceptable response time, or lower their overall performance.  In addition, fetching new metadata bumps some other metadata out of memory.  In the worst case, the working set is larger than available memory and the array is constantly churning data and metadata through memory and having to issue double the I/O’s: one for metadata and the other for user data.

These effects are often masked by typical lab or proof-of-concept testing because working data sets are small.  In other words, if you have 20TB of flash and create 20 x 100GB LUNs to test the array, you’re only hitting 2TB, or 10% of the array’s addressable capacity.  This should easily allow metadata to be cached in the controllers.  But when the array is in production and is running 80% full, with a blended workload from many servers and applications, and little predictability in the I/O pattern (poor locality of reference) – exactly the kinds of conditions for which you invest in a flash array – the controllers will no longer be able to cache the metadata required for fast performance.  This has noticeable and sometimes very significant performance effects.

The problem is exacerbated with dual-controller designs.  Adding more flash capacity without a corresponding increase in memory means that less of the metadata on a percentage basis can be held in the controller RAM.  If a 20TB dual-controller flash array that could cache 20% of its metadata is expanded to 40TB it can only cache 10% of its metadata. At 80TB it can only cache 5% of its metadata.  The larger the array gets, the worse the performance gets.

EMC XtremIO is architected so that all metadata is always in memory.  This is possible because of three important aspects of the design:

  1. XtremIO is a true scale-out system. Larger capacity arrays also have more controllers, which in turn have more RAM to hold more metadata.
  2. XtremIO’s scale-out is based on N-way active controllers, networked together using Remote Direct Memory Access. Not only do we have the RAM to hold all metadata in memory, but that memory is aggregated and shared among all the controllers in the cluster. Any controller can utilize its own memory or the memory on another controller at incredible speeds. Note that having Infiniband (or any other high bandwidth low latency network) is necessary but not sufficient for the lowest latencies. Actually implementing RDMA semantics is required to take full advantage of Infiniband capabilities. With XtremIO, we fully leverage RDMA and Infiniband is not just a fast interconnect.
  3. XtremIO’s brilliant engineers spent a lot of time and effort designing hyper-efficient metadata structures that allow us to keep mounds of granular metadata (for every 4KB stored in the array) in a minimal RAM footprint. This is one of the many patent-pending technologies in the array. Some may think what we’ve done is impossible because they haven’t figured out how to do it.

With this unique architecture, we never introduce extra metadata latency because of access patterns. This is the advantage of having 100% in-memory metadata and why we’ve been making a big deal about our architecture. If you care about consistent, predictable performance, you’ll think this is a big deal too.

Now let’s take a look at some of common areas of confusion that have been floating around since our launch…

Perhaps the biggest misconception is that because all our metadata is in memory we don’t also have a copy of it safely stored on SSD. This misconception implies that we only bother to dump metadata to SSD on shutdown and power loss. Seriously?! EMC has been the leading provider of enterprise storage systems for decades and knows about keeping data safe.

Every metadata update made on an XtremIO controller is immediately journaled over RDMA to other controllers in the cluster. These journals are persisted to SSD using an efficient write amortization scheme that coalesces and batches updates to more efficiently use the flash media and avoid write amplification.  Metadata is protected on flash using XDP (XtremIO Data Protection) and other techniques.  This is ultra safe and tolerates any type of failure, not just power outages.

Which is a good segue to another question some have raised about the need for battery backup units instead of NV-RAM.  About this we have two things to say.  First, battery backup has been used in enterprise storage systems for decades. So we’re not forging any unknown territory here.  Second, while NV-RAM has its merits, there is no NV-RAM on the market that can keep up with the performance of XtremIO. We certainly could have placed an NV-RAM card in our controllers and built a 50,000 IOPS system, but that would be way off the mark in performance.  Some solutions use mirrored SLC SSDs as their NV-RAM device.  Performance is thus bottlenecked by how fast the SLC device can go.  No matter how large the array grows, performance is always choked at that single point in the architecture.  Battery-backup isn’t needed, but you don’t get the flash performance you paid for.  Sometimes super-capacitors in the power supplies are used, which is nifty – but this is only feasible when there is no metadata to protect.  Arrays that use this technique have no thin provisioning, no deduplication, no magic VM cloning – nothing.  They are just a box of flash.

Now, back to In-Memory Metadata on XtremIO.  The way XtremIO works is no different from how other modern arrays handle metadata updates.  Where we’re different is that we only have to read metadata from SSD once: when the storage controller starts.  It means all metadata operations are handled in memory and never require additional round trips to the SSDs.  Fetching metadata from SSD impacts both read and write operations: even when metadata is being updated, it’s usually first read.  Having all XtremIO metadata in-memory gives us incredible VMware VAAI performance since all client operations just access memory before being acknowledged.  The result is consistent, predictable performance across all workloads; something that can’t be matched by architectures that need to read metadata from flash as part of client requests.  As more applications are consolidated and virtualized and the “I/O Blender” effect intensifies, this benefit will continue to grow in importance.

Another thing to consider is how the XtremIO journals are implemented.  We can complete mirrored journal writes very quickly thanks to fast DRAM access times over our low latency RDMA cluster fabric.  We considered other options like using NV-RAM (see above), dedicated RAM devices, and even a “faster” class of SSDs (see above) as the journal device.  And we rejected them all.  They introduced too much latency, complexity (risk for bugs), or bottlenecks.  All of these may have been acceptable choices when dealing with spinning disks or even hybrid storage where media access times are measured in multiple milliseconds.  For All-Flash Arrays, keeping as much of the activity in main system memory is the best way to achieve consistent, predictable performance and low latency.

Another misconception is that we use local SSDs in the X-Brick storage controllers to hold our metadata.  That’s mostly incorrect. The persistent storage medium for our metadata is the bank of 25 SSDs in each X-Brick DAE (SSD shelf). We distribute and protect the metadata across all those SSDs using XDP for precisely the same reason that we store user data there: speed and reliability.

In normal operations we read all the metadata only once from the DAE at boot time.  After that, we reliably synchronize changes back to the DAE after first journaling and mirroring the changes across controllers using RDMA.  When the cluster is stopped, it pushes out any final updates to the DAE so that everything will be there when the system is started again.  It’s only in the case of an emergency shutdown that local drives in the storage controllers may be used to persistently save any unsynchronized metadata updates.  Even then, only certain failure scenarios prevent the metadata from being saved to the DAE.  We’ve carefully considered all the various ways that storage systems can fail, and architected XtremIO to provide the performance and reliability EMC customers have come to expect.

About the Author: Dell Technologies