Survive Chaos

How one company applies its understanding of the psychological issues surrounding troubleshooting to make it an efficient and painless process.

Every engineer’s worst nightmare goes something like this: You arrive at work early Monday morning only to discover no less than three post-its and five voicemail messages indicating that nobody can access email. To make matters worse, tape backups failed over the weekend, and you’ve just moved into a new physical location, complete with new server equipment in a new and unfamiliar ATM/WAN environment. On top of it all, the two weeks that management promised for equipment burn-in never materialized.

System logs show the failure sequence. After looking up the error references in TechNet, you’re no closer to determining the cause for the failure. In the first hour, you’ve isolated the problem down to five possibilities: hardware failure (SCSI, HDD, NIC); internal Exchange database corruption; third-party applications; a WAN-related transaction or infrastructure issue; or any combination of the above.

Another Methodical Approach
For another perspective on a methodical approach to your work, read Thomas Eck’s article “White-Coat Computer Science.”

Fortunately, you have two engineers available to work on the problem. Management has the good sense to give you up to 10 days to resolve the problem, and pledges to keep users off your back as you work hard to fix this thing. (This is quite rare—usually it’s hours, not days.) You also get authorization to contact Microsoft’s Premier Support for help.

True Story

The scenario I just described recently occurred on a client site I was supporting. I consider myself a good troubleshooter, as does the other engineer who worked on the problem. But it took us over 200 hours to resolve the problem. Why so long, you say? Several reasons, but primarily the following:

  • The problem was the result of a combination of a faulty SCSI cable, a corrupted database due to IMS site connector problems, and the intrusion of a third-party application (InocuLan 4.5).
  • The mental and physical fatigue of “round-the-clock” efforts.
  • The need for 24-hour wait cycles to certify resolution (representing 25 percent of total hours).
  • Further complicating the issue was the fact that we were operating in a new network infrastructure (ATM), with new server hardware and connectivity requirements.

As a result of our ordeal, I decided to share some of the techniques we used to survive and work through these issues. I’ll skip the obvious: the importance of system logs and monitoring and TechNet KnowledgeBase queries—these are basic engineering tools that anyone reading MCP Magazine already uses.

Determine Your Resources Before Troubleshooting the Problem

In our case we had two Windows NT engineers, sympathetic management, TechNet, Internet access to newsgroups, Microsoft’s KnowledgeBase, and Microsoft Technical Support’s connectivity engineers. If you have four or more technically competent engineers, you can split into two teams. With one team working on the problem at a time, you can effectively avoid having all of these resources burn out from mental stress and fatigue. Also, you can break up your troubleshooting in stages, using the second team to escalate while the first team recovers.

Be Wary of the Clock

It’s vital that you keep track of the time you spend at every stage of the problem. Signs of fatigue show up like this:

  • You notice that you become more willing to try things that have higher and higher risk, or you become more risk-averse as your confidence in solving the problem erodes.
  • You become careless, running utilities at the wrong time and/or with the wrong switches (potentially fatal).
  • You become impatient.
  • You begin to lose the ability to do basic mathematical calculations, reasoning, and logic.
  • You experience difficulty remembering locations of data, utilities, the next or last step taken, etc.
  • You begin to go in “circles,” revisiting previously attempted efforts.

As a general rule, the more time that passes before a solution is discovered, the higher the probability that the problem won’t be resolved. In the scenario previously described, we changed many of the variables involved with the problem: service pack levels, physical disk arrays, SCSI cabling; in addition, we made modifications specific to NT, Exchange, InocuLan, and ArcServe, respectively. These steps were necessary, since we were unsuccessful in isolating a single point of cause for the problem.

Our approach involved verifying the integrity of each component to eliminate as many components as possible from our list of possible causes. Unfortunately, we began this process some 80 hours into the problem. Mental fatigue had already begun to set in, making decision-making difficult and logical thought nearly impossible. What saved us? From the beginning we made and stuck to three critical decisions, which follow.

Document, Document, Document!

This document isn’t meant for management—it’s your own lifeguard. This vital tool contains your checklist, lists your results with dates and times, and provides answers to those questions, “What have we done?” and “What do we do next?”

If you begin to go long on troubleshooting the same problem hour after hour, you’ll consider yourself both lucky and a genius for having the foresight to create this document. See “A Troubleshooting Documentation Sample” to see what we recorded.

Decide Who’ll Drive and Who’ll Navigate

This is a vital step, no matter how involved the problem is or how many IT resources you have to throw at the problem. Only one person should be “driving,” or executing console/command line parameters, launching utilities, and so on. The so-called navigator makes sure each agreed-upon step is executed, maintains the document as you go, and takes the conservative position during risky phases. In general, you want the more aggressive of the two engineers driving the process. But remember, if there’s no mutual trust, competence, or respect, these roles won’t work.

For example, the driver on our team determined that the ISINTEG (exchange d/b utility) ran, indicating no errors. The next agreed-upon step was to do an online backup of the “clean” database before re-creating the site connector. The driver asked if we should go ahead and create the connector, but the navigator said no and forced us to stick with the planned step of doing the tape backup.

Although both driver and navigator are responsible for keeping the team on course with all established strategies, the burden usually falls on the navigator—and rightly so. If you’re succeeding in your troubleshooting, there’s a tendency to rush to the end steps. Remember, seeing progress doesn’t mean the issue is resolved. Resist taking shortcuts, and execute your plan to the end. It’s the only way to be certain.

A Troubleshooting Documentation Sample
Phase III Troubleshooting

[x] Removed IMS site connector.
[x] Discussed pros/cons of X.400 vs. MTA site connector. X.400 was
ruled out, since using it would require Exchange to do data
conversion to and from other Exchange sites.
[x] Replication is scheduled at 3:30pm and 4:30pm.
[x] IS Maintenance scheduled to run from noon to 4 pm hours.
[x] MTA site connector built; replication objects were received
[x] Exchange users were removed from Exchange server access.

[ ] Applied SP3 hotfix—Consulting w/Microsoft as to hotfix vs. SP4
[x] ISINTEG—test alltests run against pub database.
[x] ISINTEG—test alltests run against priv database.
[x] Online backup
[x] All MSExchng services started (to run w/out Inoculan enabled).

Phase IV Troubleshooting

[x] Scenario A - If no errors occur:
Hardware is eliminated as a cause.
Inoculan could still be a cause.
IMS connector could still be a cause.
[ ] Inoculan is started inbound/outbound.
If no errors occur: IMS connector is probably the cause.
If errors occur: Inoculan is probably the cause.

[ ] Scenario B - If errors occur:
Hardware could still be a cause.
Problem is independent of Inoculan
Problem is possibly outside of Exchange (Infrastructure)

8:46am: No errors occurred overnight. ISINTEG verify on the private
and public databases indicated no errors or warnings. ESUTIL verify on the DS indicated no errors or warnings.
8:50am: Started Inoculan and stressed Exchange through forced DS
and IS maintenance. Anticipated runtime 2-3 hours
9:15am: DS replication occurs. Event successful
9:30am: IS maintenance request occurs. Event successful
10:00am: Running Exchange Optimizer. Services shut down
cleanly. 15min
10:15am: Performed on-line backup
10:30am: Created IMS site connector; checked integrity
11:45am: If no errors have occurred, starting IMS site connector, and
allowing users to work in email. (Exchange is at 100% functionality.)
If no errors occur, problem will have been isolated to the IMS
site connector.

Resist Changing More than One Variable or Condition at a Time

Your best chance of resolving a problem is within a static environment. The problem occurs religiously in the same sequence over and over, under the same environment. Symptoms don’t change on you, and you have a chance to attack the issue logically and eventually resolve it.

The typical solution that vendors give their customers is a patch, upgrade, or service pack. These fixes, when applied intelligently, can be very powerful tools for resolving many technical issues—but you should consider the larger picture outside of the vendor’s product line. You probably have very specific technical requirements in your LAN/WAN environment, third-party or proprietary applications, and infrastructure issues that may require you to be at a certain hardware or software version or patch level.

The time to apply a patch or other upgrade path isn’t during a crisis situation. Ideally, it should be done in a test environment. But since we’re talking about the real world, if you’re going to apply a patch during a troubleshooting process, be sure you can reverse the process in the event that it makes matters worse or has no effect whatsoever.

In the end, we spent more than 200 hours troubleshooting these problems. The organization lost five business days’ worth of data and approximately $75,000 to $100,000 in productivity. It could be argued that, had we had that two-week burn-in period that was originally requested, this failure might never have been experienced in the production environment.

Troubleshooting isn’t an exact science. By applying solid, logical approaches and establishing a clear plan of attack, you can work through even the most severe issues.

comments powered by Disqus
Most   Popular