The Heart of Your DRBC Plan -- Microsoft Certified Professional Magazine Online

The Heart of Your DRBC Plan

Second in a series: Let's look beyond software and hardware issues to look at what elements to factor into your business continuity plan. Also, what makes a warm and hot site?

By Bill Heldman
05/02/2006

Whether or not they live in New Orleans, Hurricane Katrina has caused a lot of people to start thinking much more seriously about DRBC than prior to the disaster. Consider the scenario for a moment, putting yourself in the place of Greg Meffert, the city's technology chief. You've got a myriad of applications and systems you need to be worried about:

Building permits
Police, fire, sheriff and first responder record tracking
Revenue databases
Airport systems
Water and wastewater systems
Parks applications
Environmental health
And others.

Additionally, think of all the servers, network equipment, PCs, printers, telephony systems and other specialized gear the city used on a daily basis…lots of it underwater. Filthy, greasy, sewage-infested water.
Now, donning your Monday-morning quarterback helmet for a minute, try to imagine the things you'd do differently, given a recurring scenario.

What are the systems most vital to a city? Having worked for a large city before, I can tell you anything having to do with revenue generation is right up near the top. (Of course, that's the case in any organization, isn't it?) So you would expect taxation, permitting, and other systems actually bringing in money to be the ones you'd make your highest priority in a DRBC scenario.

But what about the airport? Human services (food stamps, welfare, and housing)? Parks? Was there anyone using the parks immediately after the storm?

How about the emergency services folks like the police, fire, sheriff, ambulance drivers and paramedics? They have radio systems, wireless networks, record-tracking databases, jail systems and a host of other requirements. How about the coroner and environmental protection services? Not to be too grotesque, but from a practical standpoint, someone has to do something with those who perished.

Further, what would have happened had a bunch of city workers been killed? Suppose, for example, you lost a core building housing numerous city workers. Some sort of event killed a couple thousand while they were working away, a transformer explosion or building collapse maybe. What kind of issues would city leaders face in finding people who were savvy about various in-place, highly esoteric production systems? Imagine trying to bring in temp workers to labor on a complex, custom-built revenue and taxation system. Would the tax bills would go out accurately and on time?

Doubtful.

This is the very heart of DRBC. And, like the heart, it has two segments to it:

What are the systems that we, without exception, must protect, else our company will go out of business? (Or, our government entity will cease to function, at least for a prolonged period.)
Provided we are able to effectively duplicate those systems, but their normal users are no longer available, how do we come back with at least a minimum of human functionality?

And, there's a third important element that affects both of these questions: How quickly can we recover, given a catastrophic event? The speed element of business continuity is something we'll cover in the next edition of this series.

Laissez-faire Business Continuity

Maybe for you we've entered the "Why do I care?" zone. After all, devastating hurricanes don't happen often, do they? And it was decades ago that Mount Saint Helens erupted in Washington State, making the entire mountain look like something a five-year-old might shape out of a mud puddle.

But consider: On the tail of Katrina, recent devastating wildfires in Texas have led Governor Rick Perry to beg the federal government for more money to manage the devastation Texas has experienced, not just from the fires, but also from Katrina. Alabama and Mississippi are well behind New Orleans in recovering from the same hurricane, and the new hurricane season is just getting started. Mountainous snow storms occurred in the Northeast this year. Floods, wildfires, and landslides have had effects all up and down California. Heat so hot in Phoenix Satan checked out early. Not to mention in just one day in March of this year, the U.S. set a record for the number of tornadoes moving across southern states. One hundred tornadoes, numerous dead, all within a few terrifying early morning hours.

And we're all tired of the constant onslaught of terrorist warnings. It seems as though people are either so jaded they don't want to think about it, or they're resolved the shoe is about to drop at any time. In either case, the result is inaction.

The point is this: Even though you may think nothing will happen, there are fairly substantial odds something will happen. Would you rather have taken the time to think through the possibilities, proactively planning your reaction to them, or would you prefer be standing by the side of your boss saying "I didn't think something like this would ever happen!"

It's Risky
In both project management and risk management circles we use a technique called Risk Identification, Response and Mitigation. Let's talk through each of these elements, as they have practical import in our DRBC work.

Risk Identification/Prevention
The idea in risk identification is to get in front of a whiteboard with other decision-makers and stakeholders to try to imagine everything that could possibly happen. You want as comprehensive a list as possible. While going through your whiteboard exercise nothing is too silly -- you can always erase the silly stuff (though our friend Murphy has a way of making the silly stuff turn into the risk that actually happens). Your primary goal is to find ways to prevent risks from occurring in the first place.

The importance of a Risk Identification team cannot be understated. You cannot possibly think of every possible risk by yourself. The higher up you go in the corporate food chain for input, the more clear the scope of your efforts will become. Risk identification is an exercise requiring input from a lot of stakeholders to gain a clear picture.

Next you and the team prioritize these risks in order of the likelihood you think they might occur. For example, maybe you'll give each risk a rating between 1 and 100. The more likely something is to occur, the higher the rating.

Table 1. Risks should be assessed based on their possibility of occurring in your specific region. Earthquake might get a higher rating if you're in California, for example, than flooding, especially if your company is situated far from dams, the coast and low lying valleys.

Risk	Possibility of Occurrence
Fire	90
Flood	80
Earthquake	70
Landslide	60
...	...

Risk Response
Once done with risk identification, you and your team will hammer out responses to each risk. What do we mean by response? It depends on the risk, of course, but you want to come up with a playbook for each of the risks identified. You want to know exactly how you'll respond, and what each involved party will be doing during the response. Of course, once written up and published, you'll give a copy of these risk responses to everyone involved, along with some training to help each one understand their place should a risk arise. And, you'll practice your risk responses at least once a year.

One tuning element for Risk Response is setting up a practice mechanism where you "tabletop" a risk. Remember in the last article we said a tabletop is a practice exercise where everyone plays as though the event has occurred. The idea behind tabletops isn't to see how fit someone is, or even how ready you are for an event to occur: It is to make everyone think about those things that are needed but were not initially included. You're looking for the Ah-hah factor in your tabletops.

Noteworthy: Risk Response implies buy-in from those above you. It is critical that DRBC planning include sponsorship, buy-in and participation from the top (or as close to the top as you can get). Also, it should be pointed out that leaders need to be ready to put money behind risk mitigation strategies. For example, if you identify a fire risk, your leadership needs to commit the bucks needed to prevent the occurrence from happening. In a case like this, risk mitigation doesn't say "Buy a better extinguisher", it says "Do something about this before anything catches fire." This is a subtle differentiation, one that you should pay attention to with every risk.

Risk Mitigation/Remediation
Finally, should a risk rear its ugly head, you pull out the risk response book and go through the mechanisms to mitigate (e.g. shorten its duration and strength), and remediate the damage done.

Continuity of People Skills

When I was a senior manager working on a DRBC plan at a former place of employment, we were squarely hit with the "lost the people" phenomenon. If we lost just two people who were responsible for the day-to-day financial calculations for the organization, we found that there was no one in a building of 1,200 people who could successfully sit down and re-create the business process these two had done so much they could practically do it in their sleep.

After some research into the issue, we found a couple of people who had previously worked in the position years ago, and who retained some faint recollection about the process. In a pinch we felt we could put together enough of a team to re-create the business process at a hot site -- never mind that it was a lame and feeble re-creation.

We ran into issues with their current supervisors. They did not want their employees taking valuable time off their current job in order to retrain on activities that had only a miniscule chance of being required. While everyone totally understood the hesitancy, we also understood the expediency of the need.

Finalizing the process
Once this process has been accomplished and the risk response books are published (expect this to take days to weeks), you can't just let everything go, never revisiting the process again. Successful risk response teams revisit things annually, updating and modifying as required to make the risk response mechanisms sensible for the current environment.

Warm Site, Hot Site
All this discussion of risk and response and who should be in place leads us to the meat of this article, and back around to our original two questions: What systems, which people?

Once you've identified all mission-critical systems in the enterprise, you must now figure out a way to duplicate those systems elsewhere. Recall in my previous article that I said a warm site is simply the data replicated elsewhere for easy retrieval. It's warm because we still have to do something to be able to access the data -- whether "do something" means hire contractors to set up systems so we can access it, or we pack equipment over to the warm site for retrieval of the data, or some other mechanism.

The point is this: A warm site implies we're half-way there, not all the way. A warm site in no way fully protects your organization! It's like buying a $10,000 term life insurance policy, when you need $500,000 to pay off your house plus make sure your family can continue to live a normal lifestyle in the event of your death.

The Gear
A hot site, on the other hand, is a site that is completely ready to go. The data is being replicated over to this site on a routine basis (timing is everything, as we'll discuss momentarily). But also, computers similar to the ones used in production are set up at the hot site. They have the required applications installed on them, they have been connected to the data, and their functionality has been tested. In the event of a catastrophe, all people need to do is drive to the site, log on and begin working.

This sounds easy of course, but it is incredibly difficult to accomplish. There are numerous considerations:

Will you outsource the hot-site hosting? There are a lot of great providers who can deliver this service to you, for a fee. They will likely not provide similar computers for you. While they may provide space for you on a SAN for your data requirements, you are responsible for providing the equipment to connect data and apps.
Will you develop a hot site at another place in your corporate environment, or with a corporate partner? In this case, how can you be sure that the disk space, servers and computers required will be adequate?
The "similar computers" you provide at your hot site may not be as robust as the computers your users currently have.
Conversely, the host (whether servers or SAN) for your data may not be as robust.

In either case (non-robust client computers and servers or data repositories) means reduced computing cycles, less throughput and an overall reduction in business performance.

Most importantly, all of this effort comes at a cost…potentially a tremendous cost. So-called "C-band" folks (CFOs, CIOs) may not be inclined to be hugely generous in developing hot-site deployments.

Noteworthy: The number of times a day that you replicate data to the hot site closely corresponds to the number of hours (or minutes, or seconds) that your company is behind the eight-ball, in terms of catching back up after a disaster. If you have your SAN set to copy a snapshot of the transaction deltas over to the hot site once an hour, and workers commit 1,000 transactions every hour, at the time of a disaster and bringing up of the hot site, you are, at a minimum, 1,000 transactions behind. Someone will need to somehow re-key or re-enter those 1,000 transactions into the system.

The Folks
In a famous Sidney Harris cartoon two scientists are discussing some complex formulas written on the chalkboard. In the middle there is some writing: "Then a miracle occurs". The senior scientist says to the junior: "I think you should be more explicit here in step two." Nowhere is the A Miracle Happens Here expectation more prevalent than in DRBC.

Even though you have a hot site, if the people who had the technical expertise to understand and run the applications are gone, you haven't accomplished a complete DRBC plan. Businesses can't rely on "click-experimentation" (also known as the "by gosh and by golly" method) -- e.g. novice application users clicking through various menu options in an effort to figure out how the application works -- for day-to-day operations. You have to have people who can sit down and "drive" the application effectively so the business can rebound from the catastrophe.

You can outsource SQL Server DBAs, but you cannot easily outsource customized enterprise application expertise.

In the next article in this series, we'll explore the development of a hot-site even further, as there are far-reaching ramifications revolving around its development. "Hot site" is easy to say, but quite difficult to develop.

Register! Top 5 Hybrid AD Management Mistakes and How to Avoid Them