Beyond Robust: Resilient IT (Part 2)

In my first post on Resilient IT, I focused on why IT needs to inject random failures into its systems to make them stronger. Now, I would like to discuss how you should actually start doing it.

Modern IT systems are constantly changing and these changes introduce a risk that isn’t always obvious or visible to the person implementing them. Many management frameworks try to cope with change by introducing maturity models however, maturity by itself is not enough.

So what do you do? I suggest that the IT organization stops assuming that a system once tested and deployed is considered a robust system. Our mission as IT leaders is to continuously evolve these systems beyond robustness and into systems that benefit from failures and harm. This principle will allow IT to stay flexible and agile while at the same time remain safe and available. For IT to deliver against the promise of agility in combination with resiliency we have to advance ongoing IT system management to think of IT as an organic and continuously evolving process. Our testing and validation practices should not just be used prior to deployment but those very principles must also evolve into running operational processes and environments. We need to exert pressure on the system while it is changing in real-time to deliberately expose holes, fragility and unexpected behavior.

Introducing the same errors and stress into the system won’t be enough to make IT better. We also need to evolve our thinking about how failures occur. By introducing random events, big and small, we can test the interdependencies of systems and the people who operate them and move beyond scripted top-down fail-over tests.

I fully believe that our success in building agile and flexible IT system resilience will depend on how well we manage the deliberate introduction of stress into IT by injecting errors in unexpected places and with some degree of randomness to it. This allows IT to fully uncover unexpected and hidden dependencies in people, systems, processes and communication paths.  Some leading companies are already doing this by promoting their top operations engineers to form functions that are chartered with making the system benefit from failure. These teams are authorized to deliberately cause failures by injecting faults with some degree of randomness to observe impact and resiliency with the objective to improve the system if it behaves in unexpected ways.  These companies choose do this against running systems, processes, communication paths and people to truly ensure that their companies benefit by learning from ongoing stress to expose fragility.  These types of tests are always challenging but ultimately help build a culture of continuous self-improved resiliency and reliability.

As our IT systems are getting ever more interconnected and complex we often tend to think of them as always working and they almost always do. But when they do not work the consequences are immediate, direct and increasingly affecting society.  The best way to ensure that we don’t build IT systems that degenerate from robustness into fragility is to assume that they are fragile in the first place and introduce deliberate stress against them. Many of the largest and well-known companies are already doing this type of stress test and given the role of IT in the modern company I think any CIO should consider introducing some of these principles.  For those who are considering going through and implementing these philosophies, there are a few additional points to consider. First off, use your senior operational staff to stress the system and message that a failure is success –after all, failures uncovered are no longer hidden and once the system is adjusted to cope with those failures ultimately makes the system better.

Secondly, introduce a thought process of randomness. Test what happens to systems, people and processes when a key network component fails and a phone switch is out in combination with simulation of flooding at the DR site. Perhaps test the effects of how staff can cope witha SAN core array failure during a denial of service attack? How well does the system cope with that while building or room access to the company or data center is down?

This exercise will be extremely painful the first few times it is executed, just as it hurts and aches in muscles when they are untrained and are tasked with lifting heavy weights. The good news is that this is temporary; once this process becomes routine, IT systems will benefit and evolve and IT staff and their processes will become more resilient to fragility and hidden risk because of it.

About the Author: Par Botes