Beyond Robust: Resilient IT (Part 3)

If you have followed my series of posts on how to manage IT systems systematically so that IT benefits from failures, you will find that these approaches are similar to the ideas that Dr. Nicolas Taleb Nassim explores in his book, Antifragile: Things That Gain from Disorder. His work in this field applies to IT professionals who want to gain a deeper understanding of how complex and interconnected systems benefit from pressure. I fully acknowledge his ideas and how they relate to the ideas I post here. Using these concepts in IT is part of a journey that began a few years ago in some of the leading IT technology companies. It appears that the idea of anti-fragility (a term Dr. Taleb invented) is something which evolved naturally in IT and in parallel with Dr. Taleb’s work.

In Part 1 and Part 2, I introduced concepts of robustness and suggested a new IT management philosophy on how to move IT beyond robustness to a system that benefits from failure. I presented approaches to evolve traditional thinking of scripted top-down testing and change management to create robustness and complement this with the injection of randomness and failure – perhaps best described as introducing managed chaos into the production IT system.

However, in the world of cloud there are new dimensions introduced which must be considered in the quest to build systems that benefit from failure. It is possibly even more important to introduce IT management concepts based on reducing fragility in cloud environments due to the significant amount of abstraction through virtualization that clouds typically reside on. The non-transparency of underlying IT resources and the separation of control between the provider and the cloud consumer has the potential to add fragility as underlying complexities can cause unplanned outages during state or configuration changes.

Don’t solely rely on an SLA to prevent risk

The premise of public cloud offerings that most appeal to IT leaders are values of scalability, capital efficiency and rapid deployments delivered as a service, often combined with a penalty if the agreed to service level (SLA) is not delivered. Some IT leaders suggest that just having the penalty in place provides enough confidence in the technologies the cloud provider uses. I suggest that IT leaders pay close attention to the architecture so that they are aware of which circumstances IT services will not be delivered and what inter-dependencies there are on vendors across the stack such as infrastructure, applications, networks, end-user devices, etc. The assumption I caution IT leaders to avoid is that the environment is protected from risk just because an SLA is in effect. Ensuring there is transparency to the consumer of cloud services on changes of material interest combined with some form of ability to request information on resiliency is important to ensure that decisions that impact end-users are made with transparency and controls.

However, just having a dialogue will not uncover omissions in understanding. IT leaders should consider introducing deliberate failures and stress into the environment to verify that it can withstand stress. This is particularly important in a provider/consumer relationship like the public cloud where the IT stack now has multiple spans of control. Methods to introduce stress differ across these spans of control as well. It is important to realize that even the most integrated cloud offerings are built using many interconnected processes and technologies and failures typically happen in the intersection of technologies, processes and humans.

As discussed, public cloud architectures typically span multiple areas of control across multiple providers. Some components in the IT stack are delivered by one or more providers while other components may be provided by the organization using in-house IT staff. It is not uncommon to find that some back-end applications are delivered by a provider as-a-service while in-house IT is in charge of connecting, implementing and delivering end-point devices such as PCs, tablets and smart phones. Intermixed across this stack are multiple levels of security methods, access point controls, networking services, business continuity strategies, recovery and audit systems, business to business communication, compliance and regulatory requirements that may be specific to each company and their industry. This is further complicated by modern cloud architectures that have evolved the notion of applications that are run segmented from each other to gain isolation. These environments use advanced virtualization technologies that emphasize the principle that efficiency and scale are gained by sharing systems between applications while hiding system state within each technology layer, allowing the provider to emphasize flexibility and malleability.

This emphasis on flexibility and malleability through advanced sharing is the very reason why random stress testing and fault injection across intersection points is a great idea for the public cloud! Some of the largest consumers of public clouds have implemented technologies which deliberately cause failures at various intersection points and many of them are even delivering their approaches to deliberately cause stress into the environment as open sourced approaches into the broader IT community.

The motivation for an IT executive to introduce stress in a public cloud setting with SLA guarantees is that for most businesses, the SLA penalty isn’t as valuable to the business as the service that IT systems deliver to the business. Typically, a penalty serves as a motivator to the provider to reduce fragility and risk. The intersection between what services a cloud provider offers and how the users of a business access, interact and connect to IT is a new frontier.

Be aware of all public cloud control points

The key insight in public cloud vs. private cloud is based on the separation of span of control and the technologies which has been introduced to abstract underlying resources from the consumer of cloud capabilities. Within these technologies and management constructs in public cloud, thinking must evolve so that fragility is managed higher up in the IT stack than before.

In the past, IT executives could rely on the network routing over problems but they may no longer have visibility into network state and as a result, introduce state changes high up near the application itself to assess if the infrastructure reacts as intended. For some applications this has led to application re-design as legacy assumptions on how the infrastructure behaves during failure no longer holds true. For instance, some vendors of public cloud advocate that applications should design in failure management capabilities. Embedding state transfer capabilities into the application itself differs from the private cloud environments where many applications had clustering capabilities provided by the infrastructure itself. The basic idea of the capability remains the same, but the control point changes: in the public cloud environment, the infrastructure offers a set of controls that the application has to integrate against whereas in the private environment, the typical case is that the infrastructure is modeled to suit the application and its requirements. This aspect of public cloud relates not only to risk but also illustrates the challenge of assessing the cost to implement the capability at a defined risk; the intersection points and dependencies that require changes differs significantly across the different types of public cloud providers.

IT executives should apply the principle of moving IT beyond robustness and evolving IT into a function that benefits from failure by introducing stress systematically, deliberately and with some randomness to it. This idea applies just as much to the in-house data center as it does to the outsourced IT organization where there may be even more benefit to ensure that IT is resilient. This concept is not just relevant to the public cloud, but may be essential as span of control and visibility combined with invisible state changes and dependencies in the IT infrastructure and their connection points can lead to new and unintended interactions that make robust systems fragile.

Modern systems are amazingly agile, scalable, resilient and efficient. However, the legacy thinking of pre-deployment testing combined with change management and scripted top-down failure testing is no longer suitable. The implication for technologists and IT leaders is to evolve their thinking on how to move systems beyond static robustness. A bit of chaos with deliberate failures is a good way to get started. Good luck in your journey!

About the Author: Par Botes