Tales from the Trenches: All Through the Night

Wherein our hero spends the wee hours staring down servers, rustling up disks, reaching out for help, and pondering his own stupidity.

It was one of those straightforward “I’ve-done-this-a-million-times-before/should-only-take-a-couple-of-hours” kinds of projects. The task at hand: Install industry-standard backup software (ARCserveIT) onto industry-standard servers (Compaq ProLiants) running an industry-standard operating system (Windows NT 4.0, SP 5) equipped with industry-standard tape drives (Exabyte DATs). I had it all figured out. At about 8 p.m. on a Friday evening, I’d head on over to the co-location facility where our e-commerce servers were housed, pop in the ARCserveIT CD, and make a few mouse clicks. Next, I’d make a test backup, configure the backup schedules, and bring our production Web site back up. I told my wife I’d be home by 11.

When I finally walked out of the data center the next morning, almost 12 hours later, and got in my car to drive home, I reflected on the night’s events.

8 p.m., Friday Evening

Sitting in my office, I go over the game plan. Six NT servers run the company’s Internet e-commerce site. Three of them have internal SCSI tape drives and are currently running NT Backup. I have three ARCserveIT Enterprise licenses, along with a few open file and SQL Server agents. Management has agreed to having our 24x7 site down for a few hours so I can install and configure ARCserveIT.

I collect the license certificates, grab the ARCserveIT CD, and head out the door for the co-location facility.

8:45 p.m.

After proving to the data center tech that I’m authorized to access our equipment, I’m escorted to our server rack. I’ve been here before, but just to take stock of the servers. Tonight will be the first time I actually get my fingernails dirty. I’m pleased that my predecessor, who was primarily a programmer who also wore the hat of a network administrator, chose to use Compaq servers. But as I struggle to connect the monitor, keyboard, and mouse cables to the keyboard/video/mouse switch (KVM) in the rack, I start to wonder if he really knew what he was doing. After finally getting everything connected, I redirect our dot-com address to point to a maintenance page on another server and get to work.

9:30 p.m.

Because I want to minimize the down time of our primary Web server, I decide to install to it first. I pop in ARCserveIT and launch the setup program. I accept most of the default options, and after just a few minutes the system is restarting. I call my wife and tell her that I’m only 10 minutes behind schedule—and that I’ll see her in a couple of hours.

The file copy portion has started up, so I switch back to the first server, which is still initializing that RAID controller.

Suddenly, it offers up a lovely rendition of the infamous BSOD—the Blue Screen of Death! A fatal stop error, involving Compaq’s SCSI controller driver. As I scramble for a piece of paper and something to write with so I can record the particulars of the error, the system reboots itself. That’s when I remember that the Automatic Server Recovery (ASR) was set to restart on errors like this.

I watch the screen as the system restarts. It gets to the RAID initialization again and sure enough, the BSOD returns. It now starts its memory dump. Thankfully, it takes a good 45 seconds or so to write the contents of the 512M of RAM installed on this system, giving me enough time to jot down the significant error information.

11 p.m.

I pause for a moment to figure out how I’m going to fix this. Let’s see, what’s the textbook way to approach this kind of problem? Oh, yeah, the Emergency Repair Disk (ERD)! Only one slight problem. There is no ERD. But, hey, I’m an MCSE—I can fix this.

That’s when it hits me. One of the other servers I’m installing ARCserveIT onto has the same configuration as the one that’s crashed. I’ll just make an ERD from that machine, then use it on the first one. I track down the data center tech and ask her if I can borrow a floppy disk. We both forage through desk drawers and file cabinets and eventually locate a precious disk.

As I eject the newly created ERD, the bubble of elation surrounding me bursts—I realize that I’m also going to need an NT Server CD so I can launch setup so that I can use the repair utility to access the ERD and fix my server. Another panicked call to the data center tech!

12:15 a.m., Saturday

“Hi, this is Kevin again. You wouldn’t happen to have an NT Server CD I could borrow?”

“Wow, um, I can call the guys upstairs and see if they know where one is. I’ll get back to you in a bit.” The guys upstairs are the CCIE types—router gurus who keep the backbone of this co-location facility up and running. I’m sure they won’t know anything about an NT Server CD. But 20 minutes later the tech comes in with a CD. She says the only reason they had one on hand is because another customer of theirs accidentally left it behind earlier that day after installing some new systems.

I pop the CD in, start up the system, and wait for that RAID controller to initialize. The bubble of elation begins to form again. It gets to the point where it prompts for the ERD. I put it in and restore the Compaq SCSI driver. Just like clockwork.

Another restart, another round of waiting for that RAID controller to initialize, and another BSOD! Another burst bubble.

2 a.m.

I then realize that the best tool available to me at this point is the telephone. Time to call tech support. But which tech support? Computer Associate for ARCserveIT support? Compaq? Microsoft? As I think about this, I call my wife to let her know it’s going to be a late one. Funny—she doesn’t seem the least bit surprised.

I decide to call ARCserveIT support first. I mean, everything worked before I installed that software, right? It takes me at least 30 minutes to find the correct number to reach after-hours tech support. Once I finally have the number, I’m pleasantly surprised at how quickly I get a live voice on the other end. After asking me all the usual stuff about software version, serial number, platform and more, the support rep asks for a callback number. The way it works, she explains, is that an on-call support engineer will be paged with my callback number. I ask her how long it takes them to respond. Her reply: Anywhere from 10 minutes to two hours. I wait about half an hour, then decide to bark up a different tree.

4 a.m.

Now it’s Compaq’s turn. Maybe they can help me bring my server back to life. In less than 10 minutes I’m on the phone with an actual support engineer. Way to go, Compaq! I describe the situation and, after a pregnant pause, I’m put on hold while the techie does some research.

Ten minutes and she’s back. She wants me to run through the startup routine, describing each step as it occurs. Oh, boy! I get to wait while that RAID controller initializes yet another time. Of course, the BSOD appears, and I read off the pertinent information.

More holding time while she researches again. This time it’s only five minutes. She wants me to describe the system configuration again, especially how the various SCSI devices are set up. There’s the built-in SCSI controller, to which the tape drive is connected, and there’s the Smart 2DH RAID controller. Connected to the RAID controller are three 9G hard drives, configured as one logical drive with two partitions. The CD drive is connected to the built-in IDE controller.

Potential good news: She’s fairly confident that if we disable the tape drive, we should be able to get past the BSOD. I remind her that it’s an internal tape drive, and that the server is stuffed into a rack without sliding rack rails. It could take hours to get to the drive. No problem, she says. Just go into the server configuration utility and disable the SCSI controller. Sounds like a plan!

I reboot the server and press F10 to access the system utilities. What’s this? No system partition can be found! Did the programmer-cum-network administrator not use Smart Start to configure these servers? I check the other servers and, sure enough, none of them has a Compaq system partition. And, of course, I don’t happen to have a Smart Start CD with me either. I’m starting to get really mad at the programmer who set these systems up in the first place.

5 a.m.

I ask the Compaq engineer to hold on a moment while I download the system configuration utility from Compaq’s Web site. I switch the KVM to another server, get the file, and then start the process to create the floppy. But it actually needs several floppies (four to be exact). All I have is the one we managed to scrounge earlier in the evening. And from that earlier search I know we won’t find any more.

I come up with a plan. After the first floppy is finished I eject it, then switch back to the dead server. I boot from the floppy, and when it prompts for disk number 2, I switch back to the other server, then use that same floppy over again for the second disk. I continue this process for disks 3 and 4, and finally the utility is up and running on the dead server.

I disable the SCSI controller and reboot the server. There’s that RAID initialization again…and no BSOD! The GUI loads, and shortly I’m greeted with the logon screen. As I log in, the system advises me that at least one service has failed to load. I check the event logs and see that the tape driver hasn’t loaded, along with all the ARCserveIT services. That’s fine; that’s what’s supposed to happen with the disabled SCSI controller.

6:15 a.m.

Now we can attempt to rectify the root problem. Just then, the ARCserveIT tech calls, finally responding to the page of four hours ago. I have him call back on the land line, then conference him in with the Compaq engineer. We bring him up to date and discuss the next steps.

The ARCserveIT guy wants to know if we’ve tried the generic SCSI driver instead of the Compaq version. The Compaq tech reluctantly informs us that the on-board SCSI controller is an LSI Logic Symbios board and, yes, there is a standard driver available, but not from Compaq. I go to the LSI Logic Web site and download the generic driver.

I try installing the driver on another server configured like the one that crashed—same model ProLiant, same OS, same tape drive. Works! Then I install ARCserveIT. Still works! Of course, Compaq Insight Manager can’t get all the information from the controller now; to top it off, the Compaq tech tells me her company doesn’t support this driver.

So we’ve narrowed the problem down to the particular combination of the Compaq SCSI controller (with the Compaq driver), ARCserveIT, and the Exabyte Digital Audio Tape drive—they just don’t work together. I uninstall ARCserveIT on both machines, reinstall the Compaq SCSI driver, and bring the Web site back up, knowing that I’ll have to come back some other time to enable the SCSI controller again (this time with a Smart Start CD!).

7:30 a.m.

I call my wife and tell her I’m on my way home. On the following Monday, I sit in the weekly development meeting explaining what happened over the weekend. After I hold the group spellbound for 20 minutes, the product VP asks, “What have you learned from all this?”

“Well,” I reply, “two basic truths: Don’t leave home without installation software and blank media, and programmers make lousy network admins!”

comments powered by Disqus
Most   Popular