Windows 2000: Troubleshooting Shock Troops -- Microsoft Certified Professional Magazine Online

Bow down before troubleshooting's greatest. These Compaq pros dispense their Windows 2000 wisdom to make you an expert on network repair.

Windows 2000: Troubleshooting Shock Troops

Bow down before troubleshooting's greatest. These Compaq pros dispense their Windows 2000 wisdom to make you an expert on network repair.

By Gary Olsen et al.
08/01/2001

Nobody knows troubleshooting like Compaq. The company's Global Services operation has 15,000 consultants building and maintaining Microsoft-based enterprise solutions globally. More than 3,200 of them are Windows 2000-certified. They're so good at what they do, they supported Microsoft's beta customers for the OS during those companies' deployments.

These guys have seen it all in the course of their work, from bone-headed migration moves (read on for details!) to brilliant and elusive technical mysteries. They share what they know with each other—around the world. When a problem arises, chances are, somebody else in the organization has experienced the same dilemma—and has derived a solution.

And that's why MCP Magazine asked a core group of them to share their best troubleshooting secrets. What they proposed was massive—almost too comprehensive for a single magazine article. So we let them pick out and identify a number of problems—and their solutions—to share with you. These final choices are dilemmas experienced by a large number of people; they're serious enough to warn you about beforehand; or they help resolve a variety of related issues, such as replication problems. We've divided the troubleshooting evils into four categories: setup and installation, Active Directory (AD), networking and clustering. Read and learn.

Setup and Installation

Problem: I've implemented about 50 Remote Installation Services (RIS) servers throughout my organization, but we only have one image. Several of these servers are experiencing problems with insufficient disk space. There's a Single Interface Store (SIS) Common Store directory that has copies of all the files in the image, which seems to be used for multiple images so they can share these files. If I have only one image, can I delete the SIS Common Store and recover the disk space?

Solution: No. Deletion of the SIS Common Store directory will prevent RIS image files and any other application with files that have been converted to reparse points from accessing the backing file containing the data. In short, it'll break RIS and, possibly, other applications installed on that partition.

The function of the SIS Common Store, included when RIS is installed, is to conserve disk space by eliminating duplicate files on an NTFS volume. The two SIS components that RIS installs are SIS filter driver and SIS Groveler. SIS Groveler scans for files that are identical to one or more files on the NTFS volume using signatures and byte-by-byte comparison. It then reports the file to the SIS filter driver that creates the SIS link (NTFS reparse points), copies the file to the SIS Common Store Folder, and renames it with an arbitrary 128-bit globally unique identifier (GUID) with a .SIS extension. The original files are changed to reparse points with a "size on disk" equal to the default cluster size of the disk in most cases. Only files larger than 32KB are processed by SIS Groveler. Therefore, we can now have many instances of a file represented by reparse points link to the actual data for that file stored in the SIS Common Store Folder. The file in the SIS Common Store Folder is also called the "backing file" and contains the data. Figure 1 is an example of how ntoskrnl.exe is copied to SIS Common Store and renamed. The lower box shows the location and contents of the SIS Common Store folder, located at the root level of the drive. It contains the actual files where the reparse points are directed.

Figure 1. How ntoskrnl.exe is copied to SIS Common Store and renamed. The lower box shows the location and contents of he SIS Common Store folder containing the files where the reparse points are directed. (Click image to view larger version.)

One caveat: The backup/restore software must be SIS link-aware. Ntbackup is SIS link-aware and will call SISbkup.dll to back up and restore properly. Third-party backup solutions have to know how to call SISbkup.dll to work properly.

For further information on RIS, see the Windows 2000 Server Resource Kit, "Distributed Systems Guide," Chapter 24.

Problem: I just added a new video card and now my system won't boot. How can I recover without reinstalling Win2K?

Solution: In Windows NT 4.0, there were several answers to this:

Boot to Last Known Good (which sometimes works).
Use the Emergency Repair Disk (which no one ever has available or updated).
Create a "parallel install." Create a new installation of NT on another partition on the disk, boot to that OS and go to the broken configuration and remove the driver.

Fortunately, Win2K gives us some tools to repair this problem without a parallel install.

If Last Known Good doesn't work and there's no system state backup, this can be corrected with either Safe Mode Boot or Remote Console.

Safe Mode Boot is much like Safe Mode Boot in Windows 95 or Windows 98. You can start in Safe Mode by choosing F8 at the boot loader screen, then select Safe Mode. This will enable you to boot the system with a minimum set of drivers and services, which allow you to perform tasks such as disabling a driver or service, including the one causing the problem. Options for Safe Mode are basic Safe Mode, which starts the system with basic drivers; Safe Mode with Networking, which is similar to Safe Mode but includes networking services for connectivity; and Safe Mode with Command Prompt, which doesn't start the GUI. It only starts the command mode.

The Recovery Console is a new tool that gives you a command-line tool for repairing a system that won't start. You have three options for invoking the Recovery Console: booting from the Win2K CD; booting from the startup floppies; or selecting the Recovery Console from the boot loader screen (assuming it's been installed). Here are the console options:

Copy—Copies files to another location or name.
Del—Deletes files.
Disable—Disables services or drivers.
Fixboot—Writes a new boot sector.
Fixmbr—Repairs Master Boot Record, much like FDISK /MBR in DOS.

The Recovery Console also can be customized. For example, you can install it as part of a large deployment by using winnt32.exe /cmdcons /unattend.

Active Directory

Problem: I've heard that Win2K has a limit of about 250 sites. Our deployment will require more than 1,000 sites. I've read somewhere that if you have that many sites, you should turn off the Knowledge Consistency Checker (KCC), but that seems like a drastic step. What should I do?

Solution: This is a much advertised and much misunderstood issue. During the Win2K beta, Compaq was one of the first to see the problem. The more sites, DCs, and the like you have, the longer it takes the KCC—which by default runs every 15 minutes—to do its job. When it fires up, it takes about 90 percent of the CPU of one processor on every DC (staggered). So the "limit" is whatever you can live with, remembering that you give up 90 percent CPU utilization on the DCs.

We believe that with proper design and implementation, there's no need to turn off the KCC. Doing so would force you to do all the KCC's work manually, including creating transitive links, routing around trouble spots, creating and cleaning up connections, forming the topology using the spanning tree algorithm, adjusting for failed bridgehead servers, and so on. I don't believe this is practical.

Replication Repair Tip

When it comes to replication repair, we've found that it's important to be patient. After making changes, you can try forcing replication (Replication Monitor has the ability to push the changes out to the enterprise), but it's quite surprising at how many issues get resolved by just waiting and letting replication move the changes out naturally.

There are several options you can choose from to get around this limitation. An excellent reference is KB Q244368, "How to Optimize Active Directory Replication in a Large Network," that provides equations to predict the KCC time based on number of sites and domains, as well as good descriptions of workarounds to this problem.

One is to turn off Auto Site Link Bridging. Using the equations in Q244368, if you have 1,000 sites and five domains, the KCC time is about 45 minutes. That means it takes the KCC 45 minutes to do its job (eating 90 percent of the DC's CPU), then goes to sleep for 15 minutes, then fires up for 45 minutes. So out of every hour, your DC gets 15 minutes of CPU to do other things. Not good. However, if you turn Site Link Bridging off, this drops the KCC time to about three minutes, eliminating the problem. This eliminates transitive site links, but in a pure hub and spoke configuration, this isn't usually a problem. You can build some "backup" links if you want some redundancy and don't want the KCC to do it.

Another method is to use Super-Sites, which Compaq employs. Rather than having every location defined as a site, collect several locations into a single site. Because this forces replication in those sites to intra-site parameters (no data compression, urgent replication, and so on.), Compaq requires at least a 2MB link between these sites. Even though Compaq has a number of physical locations in Canada and Japan, because of the high-speed links between location, we only needed to define two Active Directory sites in Canada and two in Japan. Using Super Sites, it reduced 700 locations to about 80 sites.

In addition to the Design resolutions just noted there are some technical ways to solve this problem. Schedule the KCC to run at certain times on each DC, thus controlling when the CPU is hit. Load balancing is also an issue when you have more than 100 satellite sites replicating to a single hub site and one Bridgehead Server (BHS). With manual intervention you can configure multiple BHS to share the load. Because both of these issues are more critical in a branch office environment where locations are connected with VPN links, Microsoft recently published an excellent white paper, "Active Directory Branch Office Planning Guide." It includes a set of scripts and procedures aimed at scheduling the KCC and building connections for load balancing. Tools of this nature are critical if you plan on turning off the KCC. You can download the white paper at www.microsoft.com/WINDOWS2000/
techinfo/planning/activedirectory/branchoffice/default.asp.

By the way, Microsoft has promised that Windows 2002 will improve the performance of the KCC significantly, so this problem should go away. [See "Sonic Boom! Windows 2002 Smashes the Barriers" in the July 2001 issue of MCP Magazine for more on this. —Ed.]

Problem: I get Event 1000 and 1001 errors in Application Event Log in five-minute intervals; Group Policy is not taking effect; or \%windir%\sysvol\staging and ...\staging areas folders have large quantities of files.

Solution: This is usually indicative of a File Replication Service (FRS) issue. Note that Event 1000 is associated with a wide variety of descriptions. In this case it's a Userenv event with the error message "The Group Policy client-side extension Security was passed flags (17) and returned a failure status code of (3)." It's also accompanied by Scecli event 1001 with the message "Security policy cannot be propagated. Cannot access the template. Error code = 3."

FRS Replication is probably not working. FRS is one of the biggest problem areas in the orignial release of Win2K, but has been improved in Service Pack 2. It's responsible for, among other things, replicating Group Policy templates (and changes) to all DCs. When changes are made to a GPO and saved, the changed file is copied to the %systemroot%\sysvol\staging\domain and %systemroot%\sysvol\stagingareas\ compaq.com directories (note that this isn't the sysvol share). The screens in Figure 2 show the result of making changes to a GPO. The file name is NTFRS_CMP_ and is put in both directories.

Replicated GPOs

Figure 2. The two default directories to which changes in GPOs are replicated.

The DC then notifies its partners, which pull it and notify their partners, and so on. These files shouldn't stay in the staging folders longer than about 10 minutes. This happens for every change and for DFS changes as well.

To resolve this problem, back up the group policy files from %systemroot%\ sysvol\sysvol\compaq.com\policies. A simple copy to another directory or a network share is fine. You'll be glad you did! Figure 3 shows the Sysvol directory structure. Note that the policies are listed by GUID and exist in the \winnt\sysvol and \winnt\sysvol\sysvol directories. The GPOs in \winnt\sysvol\sysvol\policies are the ones that get edited via the policy editor and are replicated. The gpotool.exe output, gpotool.log, provides a nice mapping of policy name to GUID as shown in Listing 1. Note the policy GUID at the top of the section and the "Friendly name" below it.

Listing 1. This log, created by gpotool.exe, maps the policy name to the GUID.

Policy {168F03D2-9E17-443F-9AE5-7BE43A5FA453}
Policy OK
Details:
DC: mytest.net
Friendly name: New Group Policy
Object Created: 11/15/2000 5:47:38 PM
Changed: 3/16/2001 7:18:45 PM
DS version: 0(user) 2(machine)
Sysvol version: 0(user) 2(machine)
Flags: 0
User extensions: not found
Machine extensions: [{C6DC5466-785A-11D2-84D0-00C04FB169F7}
{942A8E4F-A261-11D1-A760-00C04FB9603F}]
Functionality version: 2

In diagnosing FRS problems, it's critical to install Service Pack 2. If you can't install SP2, install SP1 and hotfix Q272567. If you can't install SP1, just install the hotfix. The hotfix can be installed pre- or post-SP1 and is incorporated in SP2. You must minimally install the hotfix or you may never get to the bottom of your FRS problems.

Other matters to consider:

Resolve any AD replication problems. FRS depends on AD replication, so if AD is broken, FRS won't work either.
Stopping and restarting the File Replication Service on each DC may fix the problem (watch the staging areas—there will be a visible reduction in size).

If these tasks don't fix the problem, follow this procedure, which uses information from KB Q257338, "Troubleshooting Missing SYSVOL and NETLOG ON Shares on Windows 2000 Domain Controllers," and our experience:

Stop FRS service on all DCs.
Navigate to the Registry key HKLM\SYSTEM\CurrentControlSet\ Services\NtFrs\Parameters\Backup/ Restore\Process at Startup and set the BurFlags value to D4 on a source DC. This is usually the PDC emulator.

The BurFlags value is set to D2 on all "satellite" DCs in the domain as shown in Figure 4.

Start the FRS service on the hub DC and one other DC and wait for FRS to synchronize. Repeat for every DC in the domain. You should see the size of the staging directories change, and maybe even increase as the files are moved. As long as they're changing size, FRS is working. Be patient and let FRS work it out.
If absolutely necessary, identify the source DC (the one with the most files in the staging directory) and delete the files from the staging areas on the satellite DCs. Then repeat this procedure—turning FRS on each DC, one at a time—until it's synchronized.

Figure 3. The Sysvol directory structure lists policies by GUID. (Click image to view larger version.)

Figure 4. Setting this Registry value to 2 can help you get FRS working again. (Click image to view larger version.)

Problem: I get Event 13557 in the FRS Log: "Duplicate Connection Objects."

Solution: This event, like many in Win2K, has a standard troubleshooting procedure. However, this is a quick fix and may not solve the real problem. While I'm a big fan of the abilities of the KCC, it doesn't do a great job of cleaning up old connection objects. The easy answer is to go to the Sites and Services snap-in, find the server logging these errors, and open the NTDS Settings object. There should only be one inbound connection object from any single DC.

Duplicate connection objects will break FRS and AD replication if left unresolved. It's possible that eventually the KCC will clean them up; if not, you'll need to do it manually. KB article Q251250, "NTFRS Event ID 13557 Is Recorded When Duplicate NTDS Connection Objects Exist," is a good reference, but my experience has taught me to create a prioritized list of methods to correct this problem, starting at the top and moving down.

Remove the duplicates. Simply delete the duplicate objects in the Sites and Services snap-in. If they don't come back, you're done. Figure 5 shows duplicate connection objects on Qtest-MDC1 from Qtest-DC2. In this case, you could simply delete one of them to fix the problem.

Figure 5. To remove a duplicate object, simply delete it from the Sites and Services snap-in. (Click image to view larger version.)

Figure 6. After deleting the duplicate, make sure you have the KCC recheck the connections. (Click image to view larger version.)

If you see duplicate connections from several DCs and don't know which ones to delete, you can delete all of the connection objects, then right-click on the NTDS settings object and go to All Tasks | Check Replication Topology. In Figure 6 we deleted the duplicate connections from Qtest-DC2 and are ready to "Check the Replication Topology." This will fire up the KCC and make it re-evaluate the connections for that DC. It will create the connection objects needed.

If the duplicate connections get re-created, you need to find out why. The "why" is most likely a DNS misconfiguration or failure. In one case in Compaq's Qtest forest, we noticed a DC in Europe with 2,100 connection objects, inbound from a single DC. We deleted them, but within a few minutes there were 24 more. We found that a DNS server had its IP address changed, breaking the delegation. We corrected the delegation, deleted all the connections, forced the KCC to check the topology, and the duplicate connections ceased.

Problem: When attempting to log on to a Win2K member server or Win2K Pro workstation using a domain account, the following error message appears: "Error: Trust Relationship between this workstation and the Domain Controller Failed."

Solution: This error is usually caused by the secure channel password for the member server or workstation getting out of sync with the DC, but it could be caused by a time-zone shift between the client and the DC. A typical scenario for this problem would be removing a computer from a Win2K domain, A, and joining it to another domain, B, then later moving it back to the original domain, A. Initially, there's a machine account for this client on the A domain. When it's moved to the B domain, it creates a new account on the B domain and synchs the password with the client. When it's moved back into the A domain, the machine account is still there—it doesn't create a new one—but now the passwords don't match, resulting in the error. I've also seen it caused by moving a computer between time zones and not changing the client's time zone information.

To resolve this problem, delete the client's computer account from domain A and let replication in the site occur, which should take a maximum of five to 10 minutes. Then configure the client to join a workgroup and reboot it. This cleans up all the local machine account information. After the reboot, configure the machine into the domain and reboot again. This will create a new account and synch the passwords with the client. The reboot, which is required anyway, will purge the Kerberos tickets so new ones will be created with the new access information.

If the problem still exists, it could be a timing issue. Go to the client, open a command prompt window, and enter this command:

net time \\domaincontroller /set

where "domaincontroller" is a valid DC name that can be used to synchronize time on the client. Remember that Kerberos requires that the time difference between the two systems be less than five minutes.

Be Resourceful

Microsoft doesn't want you flying blind when troubleshooting. It offers many useful diagnostic and troubleshooting helpmates. Learn them, then use them. They include Support Tools and Resource Kit utilities. Remember to get verbose output—when troubleshooting, more knowledge is better.

Support Tools is found on the Win2K Server and Advanced Server CDs in \Support\Tools. Just run setup to install them. These tools are lifeblood, so much so that they should be installed on every domain controller (DC).

For general AD diagnostics, netdiag.exe and dcdiag.exe are two of the best. They'll generate netdiag.log and dcdiag.log files, which give great information concerning trusts, DNS, NetBIOS names, TCP/IP details and more.

Nltest.exe is a quick way to return network information such as a computer's site, site coverage and a list of DCs in the domain. You can also use it to query the domain trusts.

When it comes to replication issues, Replication Monitor and repadmin.exe are invaluable tools.

One of the best of all resources is Microsoft itself, especially TechNet (www.microsoft.com/technet). If you can't afford the CD version, go to the Web and search the Knowledge Base at http://search.support.
microsoft.com/kb/c.asp.

Problem: I just upgraded my NT 4.0 domain to all Win2K DCs and everything is broken. How can I recover my NT 4.0 domain? (By the way, I didn't remove a BDC before the upgrade as Microsoft recommends, and I have no backup!)

Solution: This scenario describes a call I got from a customer. It's absolutely the coolest thing I've done in Win2K troubleshooting. He had a single NT domain with a PDC and two BDCs. He upgraded the BDC first (don't ask me how), then the PDC. In the meantime, the other BDC had a disk crash. The Win2K domain was broken—no user authentication, no replication, no services. He wanted to recover the NT domain, but had no NT 4.0 machines left and no backup. Fortunately, he'd left it in mixed mode, so he still had a copy of the SAM database. In mixed-mode, you should still be able to add an NT 4.0 BDC and get the NT domain back. Since he was "dead" anyway, we had nothing to lose, so we used the following process and it worked! I've never seen this in any Microsoft document or training course. Here's the process:

Pick the healthiest DC to be used as a source.
Transfer all the FSMO roles to this machine if it isn't the FSMO already.
Turn the other DC off.
Pre-create a computer account for a new NT 4.0 BDC in the AD. This can be done by using Win2K's Server Manager (svrmgr.exe) or with the netdom command. Warning: Don't use NT 4.0's version of svrmgr.exe—it won't work. Win2K's version is built in. To use netdom on a Win2K DC, type:

netdom add bdcname /domain:domain name /dc

where bdcname is the name of the new BDC and domain name is the name of the Win2K domain (such as Compaq.com).

Install a computer (we picked the other Win2K machine we just turned off) as the Windows NT 4.0 BDC and join the Win2K domain (using the NetBIOS name, of course). Once this BDC joins the domain, it will sync with the PDC and get the SAM. Now you have the NT 4.0 domain intact on this BDC. Shut down the Win2K DC, leaving only the NT 4.0 BDC.
Promote the NT 4.0 BDC to PDC.
Reinstall the Win2K DC as an NT 4.0 BDC in the recovered NT 4.0 domain so you're back on solid ground. Add a second BDC for safety, let it sync with the others and pull it offline (which should have been done in the first place).
Now do the migration right. Upgrade the NT 4.0 PDC and create the Win2K domain.
Upgrade the BDC to Win2K as a replica DC in the domain.

It took the customer the better part of a day to do that, but it worked. He recovered all his accounts and completed the Win2K upgrade. Note: If the original Win2K domain (the broken one) had been changed to Native mode, none of this would have worked.

Making Active Directory Happy

The two biggest issues with making sure AD is working properly are DNS and replication. If they work, AD's generally happy. Here are some general replication tips to make sure replication's working:

Get comprehensive replication error listings from all DCs in a domain from Replication Monitor/Action Menu/Domain/Search DCs for Replication Errors.
Get a status report from Replication Monitor. Right click on a server icon and select Generate Status Report.
Run repadmin.exe /showreps to look for errors.
In Sites and Services or Replication Monitor, force replication between two DCs.
Force the KCC to regenerate the topology (Sites and Services or Replication Monitor). Look for failures.
To see if the domain naming context is being replicated, create a test user account on a DC, then force replication to another DC. Look at the Users and Computers snap-in on that DC and see if the test user's there.
To see if the Configuration and Schema naming context is being replicated, create a test site, and force replication, then see if the other DC gets the new site.

Networking

Problem: Why is it when I enter a Route Add command, the route doesn't show up in the RRAS list of static routes?

Solution: There's been quite a lot of confusion about the different ways to define static routes in Win2K Server. It started with the introduction of RRAS in NT 4.0, but it's still in the product today. This issue must be understood before any network troubleshooting takes place.

The problem is that Win2K Server allows for two separate ways of adding routes. The best way is to enter the static routes in RRAS.RRAS is a kernel-mode service with sophisticated routing capabilities. The other way, and the result of the ROUTE ADD command, is to enter the routes as a user-mode function. This routing method stems from NT 3.x days and shouldn't be used if you can avoid it. (Microsoft kept it around to avoid breaking existing scripts that customers might have.)

Figure 7. The typical output of a ROUTE PRINT command. (Click image to view larger version.)

Figure 8. Persistent routes are automatically established when a system comes online. (Click image to view larger version.)

Figure 9. The Registry can help confirm the persistent routes in your network. (Click image to view larger version.)

As Figure 7 shows, there are two interfaces in this system. The default gateway points to 216.82.49.33, and there's an internal card with address 10.0.2.1. The second route states that all 10.0.2.0 traffic is directly available to the internal subnet. For our example of the routing confusion, let's introduce a new internal subnet of 11.11.11.0. The old way of doing this is to issue the following command:

route add 11.11.11.0 mask 255.255. 255.0 10.0.2.1 -p

The -p option at the end states that this route's persistent and should always exist when the system comes online. Figure 8 shows the result of this command.

Notice that the persistent route is clearly listed in the routing table near the end Additionally, it's in the Active Routes list.

Since the route exists in the routing entries list, the network works as expected. In fact, a peek into the registry shows the persistent routes list (just like in NT 4.0). The route's listed as expected, in Figure 9.

We've established that the backward compatibility still exists and works in Win2K Server routing. Now let's move forward.

Win2K Server has two new ways to add static routes that allow the RRAS engine to handle the entries. The first way is to simply use the RRAS snap-in (see Figure 10). This has the advantage of being fairly obvious, but if you have more than just a few entries, this process would be too time-consuming.

Win2K also introduces a powerful command shell called NETSH. If you have a number of static routes and you need to create or modify a batch file, use this command. The equivalent command to the ROUTE ADD command we were using is:

netsh routing ip add persistentroute 11.11.11.0 255.255.255.0

"Private" nhop=10.0.2.1

Here you're defining a persistent route, but you must also define the interface that's handling this route and the next hop address. On this server, the internal address is named Private (Network Places | Properties | Interfaces). Because this route is being handled directly by this server instead of passing it off to another router, our next hop is the same interface. Figures 11 and 12 show the ROUTE PRINT result from this command.

As you can see, neither of the backward compatibility areas contain the new route that we've just added. The ROUTE PRINT command lists it in the routing entries, but doesn't know that it's a persistent route. RRAS, however, does (see Figure 13).

As you can imagine, this can cause confusion. If you manage servers performing routing functions and you're using static routes, I'd recommend changing from the user-mode ROUTE ADD command to using RRAS routing. The server will be able to handle more traffic with better performance; all your routing information will be in a unified location; and the router will have more flexibility in the RRAS environment.

When troubleshooting any network or routing issues, it's important to discover the complete picture of the routes applied to a server to fully understand the network details. Make sure that you look in both RRAS and ROUTE PRINT or the Registry list.

Figure 10. You can add static routes through the GUI shown here, but for more than a few entries, using a command-line utility is better. (Click image to view larger version.)

Figure 11. This ROUTE PRINT window doesn't show that the just-added 11.11.11.0 route is persistent... (Click image to view larger version.)

Figure 12. ...Nor does the Registry. (Click image to view larger version.)

Figure 13. The route is listed in RRAS, however. (Click image to view larger version.)

Windows 2000—Built on the Rock of DNS

Keep in mind that DNS is the foundation for Windows 2000, especially when you're troubleshooting Win2K. DNS will touch all aspects of the infrastructure. Make sure it's working and error-free before digging any deeper into a problem. An entire article could be written on DNS troubleshooting alone, but here are some basics.

Design the DNS structure. Get help if you don't know how.
- Keep it simple. Unless you have some very slow links to sites, we usually recommend three name servers per domain. You may want more at remote (slow link) sites.
- Work out interoperability with your corporate root name server. There are a number of options here, and Win2K DNS will play nicely with BIND servers if you do it right.
Make sure the DNS server and zone configurations are correct, with delegations, forwarding and name server lists pointing to the right IP addresses.
Make sure DC names and domain names are resolved correctly.
Make sure client DNS configuration is pointing to the right name servers. Assuming a Win2K DNS name server is hosting the Win2K domain:

DNS servers' TCP/IP properties should point to themselves for preferred DNS and to the other name servers in the domain as "additional" DNS servers.
DNS servers at the Win2K root domain should forward to the name servers registered on the Internet for Internet access. This could be a company-owned or ISP-owned server.
Clients should point to the Win2K DNS servers authoritative for their domain. Order them with thsest" servers hie "cloghest in the list.
Watch the DNS event logs for errors, but note that DNS errors will occur in the Directory Services and System logs as well.

Cluster Troubleshooting

In troubleshooting cluster problems, a number of fundamental proactive and reactive tasks apply in almost all cases.

Proactive Tasks
Get to know your cluster. It's hard to zero in on a problem when you don't have a feel for how your cluster behaves when healthy. To do this, make sure cluster logging is enabled—you can't troubleshoot a cluster problem otherwise. It's enabled by default in Win2K; but if you're running NT 4.0 Enterprise Edition, refer to KB Q168801, "How to Enable Cluster Logging in Microsoft Cluster Server," to turn on cluster logging.

Next, get familiar with the content of the log file. Because the content of the log file is verbose and cryptic, it's often hard to determine if a message is benign or malignant. Therefore, it's good practice to periodically save a copy of the cluster log file on all cluster servers. This can be used as a reference to compare against, once you experience a problem. You should also save a copy after you make changes to your cluster configuration. The cluster log will look very different before and after you've clustered SQL Server 2000!

Then remember the adage "When it rains, it pours." Since there's a good chance that next time you experience a cluster problem you'll also experience other problems, download and print out some good troubleshooting documentation, including:

Windows 2000 Server Resource Kit's chapter 20, "Interpreting the Cluster Log."
KB Q286052, "The Meaning of State Codes in the Cluster Log."
If you're running Windows NT 4.0 with the Option Pack, I also recommend Microsoft's white paper "Installing the Windows NT Option Pack on MS Cluster Server (MSCS)."
KB Q191138, "How to Install the NTOP on Cluster Server."
KB Q223258, "How to Install the NTOP on MSCS 1.0 with SQL Server 6.5 or 7.0."

Something else you can do is upgrade to Win2K. Clustering is a lot more finicky on NT 4.0 than on Win2K, mostly because Windows NT 4.0 has the Option Pack. Then do the same for your cluster-aware BackOffice products. You can cluster SQL 2000 more reliably than SQL 6.5 or SQL 7.0!

Finally, I can't overemphasize the need for a good backup. Make backups and once in a while test your recovery procedures.

Reactive Tasks
Isolate the errors in the cluster log by comparing what's normal from your saved cluster log with the events logged during the problem time. You might need to look at the cluster log on all servers in the cluster. Remember that the cluster log timestamp is GMT, so you need to calculate GMT based on your time zone setting. Once you've identified a problem area, cross-reference with the event log. Remember that those are in local time, not GMT! In Win2K you only need to look at the event log from one server since it's replicated among cluster servers.

If you don't understand an error code, use the "Net HelpMsg" command to try to get a better description of the error. Also use the Knowledge Base whenever possible.

This last bit of advice might come as a shock. Most likely, your No. 1 requirement is to get the cluster and application working as fast as possible. Most important is to understand the root cause of the problem so you can prevent it from occurring again. Once you know what caused the problem, consider all the options—you can attempt to fix the problem or you can re-install. I've found that very often it's faster to re-install a server or a cluster than to fix a complex problem. This option is often overlooked until many hours have been spent fighting a complex problem. The solution path you take will depend on the clustered application.

Troubleshooting Disk Problems
So, what to do if the problem isn't the cluster, but the disk? If your disk problem occurs right after you installed your cluster, it's probably a misconfiguration. Backtrack and verify the integrity of your shared I/O subsystem without clustering. In general, it's easier to troubleshoot standalone systems than clustered servers. Don't hesitate to stress-test your disks before you cluster—SCSI termination problems can hide when doing casual checks. The simplest stress test might be a full (not fast) format of the disk.

If, however, your disk problem occurs on a mature cluster, it's most likely caused by hardware failure. Since disk handling is very different in Win2K than NT 4.0, make sure you follow the procedure for the right version of Windows.

For NT 4.0 check out:

KB Q217224, "How to Replace a Clustered Disk in Windows NT 4.0 Enterprise."
KB Q243195, "Event ID 1034 for MSCS Shared Disk After Disk Replacement" (which explains how to fix the disk signature).

For Win2K check:

KB Q280425, "Recovering from an Event ID 1034 on a Server Cluster." (The DumpCFG utility can be found in the Resource Kit.)
KB Q217224, "How to Replace a Clustered Disk in Windows NT 4.0 Enterprise."
KB Q243195, "Event ID 1034 for MSCS Shared Disk After Disk Replacement." (This explains how to fix the disk signature.)

It's likely you'll need to disable clustering temporarily and access your disk directly. Remember this: Once you disable cluster service and the cluster disk driver, make sure that you never boot more than one server at a time or you will corrupt your shared disks!

To access disks without cluster software involvement temporarily:

Shut down and power off Server B.
Follow one of these route:s

If you're running NT 4.0:

On Server A, from Control Panel | Services change the startup of the Cluster Server service from Automatic to Disabled. To do this, highlight the Cluster Server service, and select Startup. Note: Don't stop the Cluster Server service.
From Control Panel | Devices, change the startup of the Cluster Disk device from System to Disabled. To do this, highlight the Cluster Disk device, and select Startup. Note: Don't stop the Cluster Disk device.

If you're running Win2K:

On Server A, right-click on My Computer, then select Manage. The Computer Management (Local) snap-in comes up.
At the bottom, expand "Services and Applications" and select Services. Right-click on Cluster Service and expand Properties. In the Startup type box, click the dropdown arrow and select Disabled. Then select OK to go back to Computer Management. Note: Don't stop the Cluster Server service.
At the top, select System Tools and highlight Device Manager. The visible devices appear in the results pane. On the toolbar, select View and click on Show Hidden Devices. A Non-Plug and Play Drivers option will appear in the results pane. Expand that. Right-click on Cluster Disk Driver and select Properties, then click on the Driver tab. The Startup box will be at the bottom. Click on the options dropdown arrow and select Disabled. Then select OK to return to Computer Management. Note: Don't attempt to stop the Cluster Disk device.
In the results pane, right-click on Cluster Network Driver and select Properties as before and select the Driver tab. Select Disable and OK. Note: Don't stop the Cluster Network device.

Finally, reboot Server A.
After the reboot, verify via the proper disk administration utility your access to the shared storage devices. The shared disks should show up as available and online. If you still have disk problems, it wasn't a cluster problem. If you need to format a disk, do it now. If you need to set the disk signature (in Win2K), do it now. If you want to perform some I/O test, do it now. If you need to restore some data, you can also do it now.
When you're finished working with the disks in non-clustered mode, on Server A, follow one of these paths:

For NT 4.0:

From Control Panel | Services, change the startup of the cluster service from Disabled to Automatic.
From Control Panel | Devices, change the startup of Cluster Disk device from Disabled to System.

For Win2K:

In Computer Management, under Services, reset the Cluster Service to Automatic.
In Computer Management, under Device Manager, display the hidden devices and reset the Cluster Disk device to System.
In Computer Management, under Device Manager, reset the Cluster Network to System.
Reboot Server A, then restart Server B.

How To Become an Expert Troubleshooter

Let's review some of the basic troubleshooting steps we use when dealing with a Windows 2000 problem at a client site.

Gather all the information about the problem from the person experiencing it. (This will be easier for you than for us because you know your environment.)
To start, ask some probing questions: "What was the exact error message?" Get the user to e-mail a screenshot, if necessary. "What were you doing when it happened?" Get the account and computer used, applications running, and so on. "Have you seen this before?" If so, get exact details of the previous incident. "Was it working before?" "Is there anything else that doesn't work?" "What changed prior to this problem in your environment?" Something had to change if it just "quit working."
Next, remember that logs are your troubleshooting friends. Get event logs, from both the client and server. You'd be surprised how many customers call us before ever looking at the event logs or getting exact error messages. Several Registry settings permit you to dump verbose output to the event logs for a variety of things such as replication, name resolution and Group Policy application. See Microsoft Knowledge Base articles Q220940, "How to enable diagnostic event logging for Active Directory services," and Q186454, "How to enable user environment event logging in Windows 2000."
On the same topic, don't get just any logs—get relevant logs such as dcpromo.log, userenv.log, startup.log and netlogon.log. Win2K has provided improved troubleshooting capabilities with these logs, so use them. If you haven't discovered the userenv.log, see Q221833, "How to enable user environment debug logging in retail builds of Windows 2000."
Check out network connectivity. Make sure everyone can talk to everyone else. If not, find out if others on different subnets or remote sites are experiencing the same problem or if it's isolated to a particular site. Determine if it can be reproduced elsewhere.
Next, check Group Policies in Win2K. They're complicated, to put it mildly, and can cause a host of problems. This touches network and domain security, desktop environments, account authentication and software installation.