Put aside your latest list of one-alarm fires for a few hours and spend some time planning your disaster recovery response.
Back to Work
Put aside your latest list of one-alarm fires for a few hours and spend some time planning your disaster recovery response.
Despite many predictions outside (and even some inside)
the IT profession, Y2K wasn’t a disaster. Those of you
who bought short on disaster futures, congratulations.
But this doesn’t mean disasters won’t strike. It needn’t
be an earth-killing meteor or global plague, but simply
a localized dislocation that directly affects your company’s
information systems. That could range from a broad-based
local problem to one that affects only your company. I
don’t mean a disk failure or controller going out; I’m
talking about the building burning down, getting flooded,
or something equally dramatic in which your entire system
ends up toast. Regardless of the breadth of the problem,
your concern (after your home and family, of course) will
be on your systems—and on what you’re going to do about
it.
Windows 2000 Server has a suite of recovery tools that
includes the Advanced Options menu, the tried and true
Emergency Repair Disk (ERD), and the Recovery Console.
There’s also disk mirroring, RAID, and, of course, the
high-end clustering option. While these services are welcome
indeed, none is an end in itself. Disaster recovery, while
depending on these and other features, isn’t largely a
technical issue, but a logistical one.
The first step in getting a handle on disaster recovery
is to have someone in your organization with authority
put a value on the information in your system. A retail
flower shop is going to have completely different valuation
of its information than a financial institution. Some
companies include the cost of rebuilding the systems and
restoring the information to resume business, while other
companies also look at the lost opportunity costs associated
with a down system.
If You’re Small, It’s Simple
For a small business with a standalone or very small
network, a reasonable disaster recovery plan can be pretty
straightforward. It could simply be a complete hardware
and software inventory—a shopping list that includes everything
necessary to receive a backup. You wouldn’t necessarily
need to replace all of the software; that would be replaced
from the backup. If disaster strikes, you take the list
to a supplier and begin the tedious task of rebuilding
your system to the point where it can accept the backup
tape or CD that you have religiously been creating and
storing offsite. This is down and dirty, but it works
with two caveats: You’ll lose the data between the backup
you have on hand and your last backup, and it can take
considerable time to rebuild the system.
A retail shop’s problems will also be structural—perhaps
rebuilding or finding a new location. But a standalone
service business can be up and running within a day, with
forwarded phones, if the equipment is readily available
from a local supplier.
Obviously, this gets more complicated as you add workstations
and other servers to the scenario. The key to this type
of plan isn’t that your plan is overly detailed or entirely
comprehensive; it’s that you have a plan at all. When
a crisis is upon you, you need to have steps laid out
to get you out of the situation—and these steps need to
have been decided upon and written out when things were
calm. That’s the only way to ensure you’ll accomplish
what you need to in the most efficient manner possible.
At the other end of the scale is the company that can’t
be down, period. This complicates the plan and adds astronomical
costs; however, in many cases these high costs are still
less than the lost opportunity costs of an information
system failure. An organization that can’t be down must
build and maintain complete parallel systems running concurrently
with periodic data transfers to the backup system, or
even real time. Very few organizations have this type
of requirement because the cost is prohibitive. Most large
organizations fall in the middle to upper end of disaster
recovery.
These organizations must follow a systematic process
to build a workable written and published plan. The plan
must be detailed enough to allow staff—not necessarily
those who wrote the plan—to get an information system
up, running, and available within the time specified.
Determine the Scope
The first step is to determine the boundaries of what
is considered the information system. Is it only the mainframes
or does it include departmental servers as well? You might
include 100 percent of the users or just a key 25 percent,
or perhaps just certain classes of users—say accounting.
This isn’t a technical process; it’s a management decision
that needs to be made before the plan is developed. However,
to help the decision process, knowledgeable IT staff like
you should present some choices based on what you know
to be strategic to the business. When you’ve determined
the scope, you’ll use it to develop the options available
to get the system up and running, and to return the entire
system to an acceptable level of availability.
Once the scope is decided upon, you can create a disaster
plan. For example, the objective may be that the mainframes
and departmental servers must be functional within 48
hours, with 95 percent of class A users connected and
25 percent of class B users attached. Within 72 hours,
100 percent of all users must have system access. Obviously,
the real numbers will be determined by management using
a cost benefit analysis to make decisions. Regardless,
the objectives should be things that can be clearly measured,
in order to calculate the “lost opportunity” cost. An
objective that says “most users will have access to the
systems” can’t be measured, making it useless. If system
down time costs your company $2 million a day and a loss
of more than $6 million dollars is unacceptable, then
the system must be up within three days—thus keeping losses
under $6 million.
This brings up another point. A disaster recovery plan
is a living document. Generally, the lost opportunity
costs grow rather than diminish over time in an organization
dependent upon technology. In addition, you need to keep
an eye on equipment costs to determine the expense of
rebuilding the systems, and the compatibility of the current
software with new machines. In a large company, this alone
can be one person’s reason for being.
Hot Site Service
One alternative to the cost of having a complete redundant
system waiting in the wings is to subscribe to a hot site
service bureau. This is a service organization that maintains
potential backup equipment for several customers in different
geographical locations. Each customer shares in the expense
of maintaining the equipment, thereby spreading the cost.
Another advantage with a hot site service bureau is that
you can stage real-time disaster recovery plan drills
to test the equipment and your procedures.
Regardless of whether you use a hot site service bureau
or maintain your own remote backup location, you also
have to consider user access to the other site. In addition
to equipment redundancy, you need to build a data network
backup system so that users can gain access to the new
system. Again, this can be privately built or subscribed
to through a service provider.
Another component I haven’t addressed here is a voice
network backup plan. You’ll want to consider this as carefully
as your hardware and software backup plan. Where will
your calls be forwarded and how?
Additional
Information |
MCP Magazine covered the topic
of Windows NT, SQL Server, SMS, and
SNA disaster recovery in the September
1998 issue, "Prepare for the Worst:
What You Need To Know Before
Bad Stuff Happens."
The Disaster Recovery Journal
at www.drj.com
provides sample disaster recover plans
and requests for proposals, along with
on-going editorial on the subject. You'll
also find information for the "disaster
recovery newbie." You'll have to
register for free access, then await
a password.
Most large service firms offer recovery
and protection services or hot site
service. If you purchase your systems
from a particular vendor, such as Compaq,
IBM, or Dell, check into their offerings.
|
|
|
Avoidance Is Critical
One final thought. As critical as a thorough disaster
recovery plan is to the business, the other important
component is a disaster avoidance plan. Your security
plans should include non-technical measures such as securing
power patch panels, running automatic virus scanning,
locking down workstations and servers, and other obvious
but often overlooked items.
The biggest problem with most disaster recovery plans
in most companies is simply that there isn’t one. As a
support professional, take the time to think through what
you’d have to do to completely rebuild your information
system in a different location. After your heart starts
beating again, let management know of your concerns in
writing. Your role as a technical support or design professional
is to outline the implications of each system failure
and then present them in a way that helps management make
a cost evaluation. You can then use that to develop a
plan that addresses the judgments upper management has
made. In other words, it’s a classic CYA situation—with
a professional approach, of course.
About the Author
Michael Chacon, MCSE, MCT, is a directory services architect, focusing on the business and technical issues surrounding identity management in the enterprise. He is the co-author of new book coming from Sybex Publishing that covers the MCSA's 70-218 exam.