Hard Drive Fall Down, Go Boom!

Further investigation of a hard drive failure revealed one tiny but important tip that now lets one company's e-mail server run problem-free.

In February 2001, we brought our e-mail hosting in-house using Exchange 2000. We installed the application on an IBM server with dual Pentium III 800 MHZ processors, 1GB RAM and four 10,000 RPM 36GB hard drives in a RAID 5 configuration. This server was more powerful than we needed, but we wanted room for growth.

Two months later we started having some problems with our new e-mail server. For no apparent reason it would stop responding: The monitor would go black, and no combination of keystrokes would bring it back. The only way we could bring the server back was to do a hard reboot. The first time the server stopped responding, we chalked it up to a Windows 2000 glitch. However, when the server started crashing on a regular basis, the level of concern increased exponentially.

At the time we had just 125 e-mail users, so the hardware was more than sufficient to handle the traffic. That wasn’t the problem. The server wasn’t going into sleep mode, so that wasn’t the cause either. The event logs were clear (we were logging not just Windows events but also Exchange events). As far as monitoring performance, it appeared that all commonly used counters were well within acceptable ranges.

Also in this issue:

 Get Active Directory Replication Right!
by Andrew Lindley

 Exchange 2000 Upgrade, Times Two
by Cynthia Balusek

 Wireless Meets Mother Nature
by Justin Melot

 The Expiration Date That Did Us In
by Jeremy Dillinger

 Troubleshooting Under Pressure
by James D. Pollock

(Back to introduction.)

One Saturday in April our e-mail world collapsed. I got a call from the senior IT director at around 10 a.m. He was attempting to use Outlook Web Access from home and it wouldn’t respond. He decided to go into the office and check out the server. He noticed the e-mail server wasn’t responding, so he did a hard reboot. During the boot process, he received a horrible message: Inaccessible Boot Device.

I ran over to the office. We tried another hard reboot, with no luck. I immediately got on the phone with IBM support. Since the drives were in a RAID 5 configuration, we should have been able to get the server back up. We were able to determine which of the hard drives was the problem. However, the IBM technician determined that the parity stripe had become corrupt. Thus, the only thing we could do was replace the drive, reinstall the OS, reinstall Exchange and restore from backups. Since we had 24x7x4-hour support, a new hard drive was in my hands in four hours. By about 4 a.m. Sunday morning the server was back up and all key employees were notified by voice mail of the problem and told they might be missing some mail.

The server was back up, but we still had no explanation as to why the crash occurred. We needed an answer and needed it fast, in case the problem occurred again. We felt it was absolutely a hardware issue, so we continued to work with IBM support. Finally, an extremely bright IBM technician made a discovery. Evidently, a batch of hard drives was sent out with bad microcodes. We downloaded a tool from IBM to examine the microcodes on our hard drives and the three “old” hard drives in the e-mail server had bad microcodes (the new hard drive was fine). We updated the microcodes on these three hard drives and our e-mail server has been continuously running now for over a year without any problems.

About the Author

Christopher M. Roscoe, MCSA, CIW, is the senior network administrator at National Packaging Solutions Group, a manufacturer of corrugated boxes.

comments powered by Disqus
Most   Popular