In-Depth
The Heart of Your DRBC Plan
Second in a series: Let's look beyond software and hardware issues to look at what elements to factor into your business continuity plan. Also, what makes a warm and hot site?
- By Bill Heldman
- 05/02/2006
Whether or not they live in New Orleans, Hurricane Katrina has caused
a lot of people to start thinking much more seriously about DRBC than
prior to the disaster. Consider the scenario for a moment, putting yourself
in the place of Greg Meffert, the city's technology chief. You've got
a myriad of applications and systems you need to be worried about:
- Building permits
- Police, fire, sheriff and first responder record tracking
- Revenue databases
- Airport systems
- Water and wastewater systems
- Parks applications
- Environmental health
- And others.
Additionally, think of all the servers, network equipment, PCs, printers,
telephony systems and other specialized gear the city used on a daily
basis
lots of it underwater. Filthy, greasy, sewage-infested water.
Now, donning your Monday-morning quarterback helmet for a minute, try
to imagine the things you'd do differently, given a recurring scenario.
What are the systems most vital to a city? Having worked for a large
city before, I can tell you anything having to do with revenue generation
is right up near the top. (Of course, that's the case in any organization,
isn't it?) So you would expect taxation, permitting, and other systems
actually bringing in money to be the ones you'd make your highest priority
in a DRBC scenario.
But what about the airport? Human services (food stamps, welfare, and
housing)? Parks? Was there anyone using the parks immediately after the
storm?
How about the emergency services folks like the police, fire, sheriff,
ambulance drivers and paramedics? They have radio systems, wireless networks,
record-tracking databases, jail systems and a host of other requirements.
How about the coroner and environmental protection services? Not to be
too grotesque, but from a practical standpoint, someone has to do something
with those who perished.
Further, what would have happened had a bunch of city workers been killed?
Suppose, for example, you lost a core building housing numerous city workers.
Some sort of event killed a couple thousand while they were working away,
a transformer explosion or building collapse maybe. What kind of issues
would city leaders face in finding people who were savvy about various
in-place, highly esoteric production systems? Imagine trying to bring
in temp workers to labor on a complex, custom-built revenue and taxation
system. Would the tax bills would go out accurately and on time?
Doubtful.
This is the very heart of DRBC. And, like the heart, it has two
segments to it:
- What are the systems that we, without exception, must protect, else
our company will go out of business? (Or, our government entity will
cease to function, at least for a prolonged period.)
- Provided we are able to effectively duplicate those systems, but their
normal users are no longer available, how do we come back with at least
a minimum of human functionality?
And, there's a third important element that affects both of these questions:
How quickly can we recover, given a catastrophic event? The speed element
of business continuity is something we'll cover in the next edition of
this series.
Laissez-faire
Business Continuity |
Maybe for you we've entered the "Why do I care?"
zone. After all, devastating hurricanes don't happen
often, do they? And it was decades ago that Mount Saint
Helens erupted in Washington State, making the entire
mountain look like something a five-year-old might shape
out of a mud puddle.
But consider: On the tail of Katrina, recent devastating
wildfires in Texas have led Governor Rick Perry to beg
the federal government for more money to manage the
devastation Texas has experienced, not just from the
fires, but also from Katrina. Alabama and Mississippi
are well behind New Orleans in recovering from the same
hurricane, and the new hurricane season is just getting
started. Mountainous snow storms occurred in the Northeast
this year. Floods, wildfires, and landslides have had
effects all up and down California. Heat so hot in Phoenix
Satan checked out early. Not to mention in just one
day in March of this year, the U.S. set a record for
the number of tornadoes moving across southern states.
One hundred tornadoes, numerous dead, all within a few
terrifying early morning hours.
And we're all tired of the constant onslaught of terrorist
warnings. It seems as though people are either so jaded
they don't want to think about it, or they're resolved
the shoe is about to drop at any time. In either case,
the result is inaction.
The point is this: Even though you may think
nothing will happen, there are fairly substantial odds
something will happen. Would you rather have taken the
time to think through the possibilities, proactively
planning your reaction to them, or would you prefer
be standing by the side of your boss saying "I
didn't think something like this would ever happen!"
|
|
|
It's Risky
In both project management and risk management circles we use a technique
called Risk Identification, Response and Mitigation. Let's talk through
each of these elements, as they have practical import in our DRBC work.
Risk Identification/Prevention
The idea in risk identification is to get in front of a whiteboard
with other decision-makers and stakeholders to try to imagine everything
that could possibly happen. You want as comprehensive a list as possible.
While going through your whiteboard exercise nothing is too silly -- you
can always erase the silly stuff (though our friend Murphy has a way of
making the silly stuff turn into the risk that actually happens). Your
primary goal is to find ways to prevent risks from occurring in the first
place.
The importance of a Risk Identification team cannot be understated. You
cannot possibly think of every possible risk by yourself. The higher up
you go in the corporate food chain for input, the more clear the scope
of your efforts will become. Risk identification is an exercise requiring
input from a lot of stakeholders to gain a clear picture.
Next you and the team prioritize these risks in order of the likelihood
you think they might occur. For example, maybe you'll give each risk a
rating between 1 and 100. The more likely something is to occur, the higher
the rating.
Table 1. Risks
should be assessed based on their possibility of occurring in your
specific region. Earthquake might get a higher rating if you're in
California, for example, than flooding, especially if your company
is situated far from dams, the coast and low lying valleys. |
Risk
|
Possibility
of Occurrence |
Fire |
90 |
Flood |
80 |
Earthquake |
70 |
Landslide |
60 |
... |
... |
|
|
Risk Response
Once done with risk identification, you and your team will hammer out
responses to each risk. What do we mean by response? It depends on the
risk, of course, but you want to come up with a playbook for each of the
risks identified. You want to know exactly how you'll respond, and what
each involved party will be doing during the response. Of course, once
written up and published, you'll give a copy of these risk responses to
everyone involved, along with some training to help each one understand
their place should a risk arise. And, you'll practice your risk responses
at least once a year.
One tuning element for Risk Response is setting up a practice mechanism
where you "tabletop" a risk. Remember in the last article we
said a tabletop is a practice exercise where everyone plays as though
the event has occurred. The idea behind tabletops isn't to see how fit
someone is, or even how ready you are for an event to occur: It is to
make everyone think about those things that are needed but were not initially
included. You're looking for the Ah-hah factor in your tabletops.
Noteworthy: Risk Response implies buy-in from those above
you. It is critical that DRBC planning include sponsorship, buy-in and
participation from the top (or as close to the top as you can get).
Also, it should be pointed out that leaders need to be ready to put
money behind risk mitigation strategies. For example, if you identify
a fire risk, your leadership needs to commit the bucks needed to prevent
the occurrence from happening. In a case like this, risk mitigation
doesn't say "Buy a better extinguisher", it says "Do
something about this before anything catches fire." This is a subtle
differentiation, one that you should pay attention to with every risk.
Risk Mitigation/Remediation
Finally, should a risk rear its ugly head, you pull out the risk response
book and go through the mechanisms to mitigate (e.g. shorten its duration
and strength), and remediate the damage done.
Continuity
of People Skills |
When I was a senior manager working on a DRBC plan
at a former place of employment, we were squarely hit
with the "lost the people" phenomenon. If
we lost just two people who were responsible for the
day-to-day financial calculations for the organization,
we found that there was no one in a building of 1,200
people who could successfully sit down and re-create
the business process these two had done so much they
could practically do it in their sleep.
After some research into the issue, we found a couple
of people who had previously worked in the position
years ago, and who retained some faint recollection
about the process. In a pinch we felt we could put together
enough of a team to re-create the business process at
a hot site -- never mind that it was a lame and feeble
re-creation.
We ran into issues with their current supervisors.
They did not want their employees taking valuable time
off their current job in order to retrain on activities
that had only a miniscule chance of being required.
While everyone totally understood the hesitancy, we
also understood the expediency of the need.
|
|
|
Finalizing the process
Once this process has been accomplished and the risk response books are
published (expect this to take days to weeks), you can't just let everything
go, never revisiting the process again. Successful risk response teams
revisit things annually, updating and modifying as required to make the
risk response mechanisms sensible for the current environment.
Warm Site, Hot Site
All this discussion of risk and response and who should be in place leads
us to the meat of this article, and back around to our original two questions:
What systems, which people?
Once you've identified all mission-critical systems in the enterprise,
you must now figure out a way to duplicate those systems elsewhere. Recall
in my previous article that I said a warm site is simply the data
replicated elsewhere for easy retrieval. It's warm because we still have
to do something to be able to access the data -- whether "do something"
means hire contractors to set up systems so we can access it, or we pack
equipment over to the warm site for retrieval of the data, or some other
mechanism.
The point is this: A warm site implies we're half-way there, not
all the way. A warm site in no way fully protects your organization! It's
like buying a $10,000 term life insurance policy, when you need $500,000
to pay off your house plus make sure your family can continue to live
a normal lifestyle in the event of your death.
The Gear
A hot site, on the other hand, is a site that is completely ready
to go. The data is being replicated over to this site on a routine basis
(timing is everything, as we'll discuss momentarily). But also, computers
similar to the ones used in production are set up at the hot site. They
have the required applications installed on them, they have been connected
to the data, and their functionality has been tested. In the event of
a catastrophe, all people need to do is drive to the site, log on and
begin working.
This sounds easy of course, but it is incredibly difficult to accomplish.
There are numerous considerations:
- Will you outsource the hot-site hosting? There are a lot of great
providers who can deliver this service to you, for a fee. They will
likely not provide similar computers for you. While they may provide
space for you on a SAN for your data requirements, you are responsible
for providing the equipment to connect data and apps.
- Will you develop a hot site at another place in your corporate environment,
or with a corporate partner? In this case, how can you be sure that
the disk space, servers and computers required will be adequate?
- The "similar computers" you provide at your hot site may
not be as robust as the computers your users currently have.
- Conversely, the host (whether servers or SAN) for your data may not
be as robust.
In either case (non-robust client computers and servers or data repositories)
means reduced computing cycles, less throughput and an overall reduction
in business performance.
Most importantly, all of this effort comes at a cost
potentially
a tremendous cost. So-called "C-band" folks (CFOs, CIOs) may
not be inclined to be hugely generous in developing hot-site deployments.
Noteworthy: The number of times a day that you replicate
data to the hot site closely corresponds to the number of hours (or
minutes, or seconds) that your company is behind the eight-ball, in
terms of catching back up after a disaster. If you have your SAN set
to copy a snapshot of the transaction deltas over to the hot site once
an hour, and workers commit 1,000 transactions every hour, at the time
of a disaster and bringing up of the hot site, you are, at a minimum,
1,000 transactions behind. Someone will need to somehow re-key or re-enter
those 1,000 transactions into the system.
The Folks
In a famous Sidney
Harris cartoon two scientists are discussing some complex formulas
written on the chalkboard. In the middle there is some writing: "Then
a miracle occurs". The senior scientist says to the junior: "I
think you should be more explicit here in step two." Nowhere is the
A Miracle Happens Here expectation more prevalent than in DRBC.
Even though you have a hot site, if the people who had the technical
expertise to understand and run the applications are gone, you haven't
accomplished a complete DRBC plan. Businesses can't rely on "click-experimentation"
(also known as the "by gosh and by golly" method) -- e.g. novice
application users clicking through various menu options in an effort to
figure out how the application works -- for day-to-day operations. You
have to have people who can sit down and "drive" the application
effectively so the business can rebound from the catastrophe.
You can outsource SQL Server DBAs, but you cannot easily outsource customized
enterprise application expertise.
In the next article in this series, we'll explore the development of
a hot-site even further, as there are far-reaching ramifications revolving
around its development. "Hot site" is easy to say, but quite
difficult to develop.