Beyond Robust: Resilient IT (Part 1)

IT is rapidly becoming an essential resource to corporations and societies. Recent ground breaking legislations have been introduced in Asia Pacific where companies in some industries are required to ensure that their IT systems function under stress. To ensure compliance, governments mandate that simulations and testing are carried out against the IT infrastructure and that the findings are audited and reported.

Working with several industry leaders and governments, I have studied this problem to help develop the rules for this new mandate as well as assist customers and partners in how to prepare their IT environments for these new requirements.

Typically, failure that results in end-user impact is rarely caused by a failed component alone. It is more likely that the root causes of failure are found in unexpected interactions between components during a failure event. For example: A bank had an older version of firmware in one of their IT systems. An operator got a warning that the network connecting to one of the IT components experienced intermittent disconnections and decided to replace a cable. The operator followed procedure for replacing the cable, unfortunately the operator used the wrong cable and while the procedure was correct for the most recent version of the systems firmware, it was incorrect for the version of the firmware the bank presently operated. This error became fatal when combined with the wrong cable. Subsequent events caused a cascading chain of failures which ultimately took down the entire system.  The natural instinct should have been to activate the business continuity plan. However, the bank had not exercised the plan for quite some time and with the state of the current system, the bank was afraid of activating the plan since they feared they could not contain the problem and it would cascade further. The architecture that was meant to be robust had with the introduction of a random event become weak.

So how can IT protect itself from these types of failures? Historically, and perhaps typically, IT professionals have relied on staged and infrequent testing of resiliency by starting up backup systems as a form of test. I suggest that we need to evolve our thinking and learn to treat IT less as a static system that once deployed must be changed as little as possible and begin to recognize that modern IT systems more closely resemble an organic and evolving process where we emphasize flexibility and adaptability over rigidity and control. In this new world of IT we actually harden the IT systems by deliberately and consciously exposing them to ongoing degrees of stress and pressure. This allows us to evolve the system from being static and robust to a system that benefits from pressure and random failure, where failures by themselves are a benefit in that they evolve the system to go beyond robustness.

Static IT systems of thirty years ago are a thing of the past and the flexible software-driven IT infrastructure of today needs pressure and stress to uncover hidden risks from random events. Just as the human body grows stronger when muscles are stressed by causing micro tears in muscles through exercise, we have found that IT processes and systems benefits from the same principle.

About the Author: Par Botes