Resurrection from the Blue Screen of Death
In an ideal IT world, restoring data to a new server from a dead one wouldn't be a problem: Just rebuild the data and apps from your backup. Welcome to the less-than-perfect world of IT.
- By Jim Richards
During a disaster recovery test, you're almost always dealing with restoring
data from tape to a platform that doesn't contain the same hardware. If
you perform a full system restore and the OS won't boot successfully,
your test fails miserably. You lose precious hours in these situations
while administrators scurry looking for answers. In a real-world disaster
recovery scenario, this could cost your company thousands of dollars in
downtime and could even put you out of business altogether.
A colleague and I were attempting to restore our main application server
and ran into this problem. In this case, we were attempting to restore
a system originally housed on a Compaq ProLiant 6500 with a Smart 2DH
controller, to a Compaq ML570 with a SmartArray 5300. Fortunately, we
attempted the restore on our test network first to document any problems
before the actual upgrade of the system. Our operating system was NT Server
4.0 Enterprise Edition.
One, Two, Three Strikes You're…
Our restore kept blue screening on startup,
with the following error message: STOP:0x0000007B_INACCESSABLE_BOOT_DEVICE.
We tried everything to get around it. First, we tried booting into the
last known good hardware profile. That would've been too easy. Strike
one. Then we tried booting using an NT boot disk with a boot.ini file
lifted from another Compaq server with a "HAL Recovery Option." Strike
two. Finally, we replaced the controller with a spare Smart 2DH we had,
created a new RAID volume, loaded NT, and ran the restore again. It worked!
Then it hit me…what if? What if this had been a real disaster, and we
were sitting in some room down in Philadelphia with nothing but some foreign
hardware and our tapes? Disaster recovery vendors do a good job of matching
up your equipment, but they don't carry everything. What if they didn't
have the controller we needed? Knowing that our CIO would be asking us
the same question, we decided that we'd better find a workaround.
It turns out we'd been lucky. All the times we had done off-site disaster
recovery testing, we'd never run into this problem. I kept remembering
what our disaster recovery vendor had said the last time we were in Philly:
Other companies typically restore only their data drives. They re-create
their shares and reinstall applications from scratch. What? That sounded
absurd to us. Do these guys know how customized our systems are? That's
when I realized that this problem was affecting other companies as well,
no matter what hardware was being used.
We knew the hardware in the new system was different and this was causing
the problem. We called Compaq and learned that the new RAID controller
was using a similar chipset. This was causing the restored OS to think
the driver it was supposed to load was correct for the device physically
present in the system. Anyone who has done NT builds has experienced the
problem of a system bluescreening after installing new hardware (such
as a video controller).
Once we realized what was going on, we attempted (through various support
channels) to find a workaround for the problem, without success. We theorized
that if we performed a parallel installation of NT, we should be able
to mount the registry of the (original) target build and change the relative
settings. Then we found a TechNet article (Q198859, "Starting Windows
NT from a Replacement SCSI Adapter of a Different Type") that outlines
the steps required to start NT from a replacement SCSI adapter. That was
all we needed.
You can combine the steps from the Q article and the following real-world
example in a disaster recovery situation to restore systems with dissimilar
RAID devices and even overcome other types of hardware conflicts.
The Registry is Key
After you've done your restore from tape, boot the system and make a note
of the .SYS drivers that failed to load when the Blue Screen of Death
appears. These will be specific to your original hardware. The common
drivers for Compaq RAID controllers are outlined in Table 1 (see "Resurrection,
The driver disk supplied by your manufacturer will give you the name
of this driver. The .INF file included with the driver disk can be opened
in Notepad to discover the driver name related to your controller type.
Having good documentation here really helps.
Perform a parallel build of NT and leave the current file system intact,
with no changes. Use a directory name like "sos" for the system directory.
You won't need network support. Log into the "sos" build as the administrator
and start regedt32.exe.
The first problem may be getting your parallel build completed. After
popping in your NT Server CD, you may realize that NT Setup has also detected
the RAID controller incorrectly. Setup inevitably fails with a "hard disk
not found" error. (You won't be able to run Compaq's SmartStart program
and load the OS with driver support because it would require a full system
erase, which defeats your purpose.) You need the OS that's already on
the system properly.
Have the correct drivers for the new RAID controller and make setup load
support for the device by pressing F6 repeatedly when the first NT Setup
(blue) screen appears. This disables the auto detection of devices during
setup. Doing this allows you to specify the driver for your RAID controller
manually. Do so by pressing "S" to specify additional SCSI devices when
prompted. You'll also be prompted to overwrite the NT common files in
C:\Program Files\Common Files. Choose "no to all." This prevents setup
from overwriting your common files that were updated by later service
packs applied to your original installation.
The next step is to save two keys to .REG files:
Drivername is the name of the RAID driver in use by the "sos" build.
It's the driver loaded during the "sos" build. Again, use the chart to
figure out the key name based on your Compaq RAID controller type.
Then load the system hive from the original installation into regedt32.
It should be in the winnt\system32\config directory and will be called
simply "system." Create and restore the two keys you saved in the two
.REG files to their appropriate locations in the loaded "winnt" system
hive. You may have to set the security at the root of this loaded hive
to "everyone" = "full control" in order to perform the restore. This will
give you the keys you need to boot the system.
It's important to disable the original device drivers that are causing
the blue screen. Do this by changing the "Start" value to "0x4" from "0x0"
for these keys:
Old.drivername will be the name of the RAID driver that was in use by
the original "winnt" build. It's also the driver causing the blue screen.
Unload the "winnt" system hive. This will save the new settings. Then
close regedt32. Don't forget to copy the "new.drivername.sys" file from
the "sos" build to the "winnt" build. This is so the "winnt" build can
find the driver specified in the keys you just added. The default location
for these drivers is %systemroot%\system32\drivers\. Set the view settings
to "show all files."
And that's it. Reboot and specify the "winnt" build when prompted at
start-up. Your system will use the new controller drivers and should boot
successfully. Once you're comfortable that the system's booting properly,
you can remove the "sos" directory and any references to it in the boot.ini.
Perhaps the most important thing we learned during this process was not
to panic. Nothing in IT is truly new. Somebody out there has experienced
what you're going through at least once before. Thorough, hard-copy documentation
of your equipment resources will serve you well in a crisis. Funny how
the old rules still apply.
Stan Jourdan, network administrator, also contributed to this article.
Jim Richards, MCSE, MCP+Internet, is a network engineer in Boston, Massachusetts. He can be reached at firstname.lastname@example.org.