Apr 5, 2019

Business Continuity and Disaster Recovery Basics: Testing 101

Paul Painter, Director, Solutions Engineering

Disaster Recovery

“Luck is what happens when preparation meets opportunity.” – Seneca

As I covered in another blog post, the first step to any effective business continuity and disaster recovery program is crafting a thoughtful, achievable plan.

But having a great business continuity and disaster recovery plan on paper doesn’t mean that the work is done. After all, how do you evaluate the efficacy of your plan or make adjustments before you actually need it? The answer: by putting it to the test.

Disaster Recovery Plan Testing

I am fond of saying that managed services are a three-legged stool made up of technology, people and processes. If you lose any one leg, the stool falls over. And since an IT department is essentially offering managed services to the wider organization, IT management should think in terms of the same triad.

Let’s break it down:

Technology: the tool or set of tools to be used
People: trained, knowledgeable staff to operate the technology
Processes: the written instructions for the people to follow when operating the technology. (See another blog I wrote for more information: “6 Processes You Need to Mature Your Managed Services.”)

For a disaster recovery scenario, you need to test the stool to make sure that each leg is ready and that the people know what to do when the time comes. One useful tool for this is a tabletop exercise (TTX). The purpose of the TTX is to simply get people thinking about what technology they touch and what processes are already in place to support their tasks.

Tabletop Exercise Steps

Let’s walk through the stages of a typical TTX.

No. 1: Develop a Narrative

Write a quick narrative for the disaster. Start off assuming all your staff are available, and then work through threats that you may have already identified. Some examples:

Over the weekend, a train derailed, spilling hazardous materials. The fire department has evacuated an area that includes your headquarters, which contains important servers.
Just 10 minutes ago, your firm’s servers were all struck by a ransomware attack.
Heavy rains have occurred, and the server room in the basement is starting to flood.

Now, some questions and prompts for your staff:

What should we do?
How do we communicate during this?
How do we continue to support the business?
What are you doing? Show me! (Pointing isn’t usually polite, but this might be a time to do so.)
How do we communicate the event to clients, customers, users, etc.?

Going through the exercise, you’ll likely find that certain recovery processes are not properly documented or even completely missing. For example, your network administrator might not have a written recovery process. Have them and any other relevant staff produce and formalize the process, ready to be shared at the next TTX.

Continue this way for all the role-players until your team can successfully work through the scenario. You will want to thoroughly test people’s roles, whether in networking, operating systems, applications, end user access or any other area.

No. 2: Insert Some Realism

Unfortunately, we have all seen emergency situations and scenarios, such as the 9/11 terrorist attacks, where key personnel are either missing, incapacitated or even deceased. In less unhappy scenarios, some staff might not be able to tend to work since their home or family was affected by the disaster. For the purposes of a TTX, you can simply designate someone as being on vacation and unreachable, then have them sit out.

Ask:

Who picks up their duties?
Does the replacement know where to find the documentation?
Can the replacement read and understand the written documentation?

No. 3: “DIVE, DIVE, DIVE!”—Always Be Prepared

Just like a submarine commander might call a crash dive drill at the most inopportune time, call a TTX drill on your own team to test the plan. For this, someone might actually be on vacation. Use that to your advantage to make sure that the whole team knows how to step in and how to communicate throughout the drill. You might even plan the drill to coincide with a key player’s vacation for added realism.

No. 4: Break Away From the Table

Once you’ve executed your tabletop exercise, now it’s time to do a real test! Have your team actually work through all of the steps of the process to fail over to the recovery site.

Again, you will want to test that the servers and application can all be turned up at the recovery environment. To prevent data islands, make certain that users can successfully access your applications’ recovery site from where they would operate during a disaster. Here are some questions for user access testing:

Can users reach the replica site over the internet/VPN?
Can users use remote desktop protocol (RDP) to connect to servers in the replica environment?
If users in an office were displaced, could they reach the replica site from home using an SSL VPN?

No. 5: Bring in a Trusted Service Partner

The help that an IT service provider provides you doesn’t have to stop with managing your Disaster Recovery as a Service infrastructure or environment. With every INAP DRaaS solution, you get white glove onboarding and periodic testing to make sure that your plans are as robust as you need them to be. Between scheduled tests, you can also test your failover at will, taking your staff beyond tabletop exercises to evaluate their ability to recover the environment on their own. Staying prepared to handle disaster is a continuous process, and we can be there every step of the way to guide you through it.