In-Depth
File Replication Service: User-Friendly at Last
The File Replication Service (FRS) has a justifiably bad reputation for bugginess and indecipherable logs. But recent changes from Redmond make it worth another look.
The File Replication service (FRS) is a mystery to most Windows administrators I’ve talked to. And it’s no surprise why: It’s not properly covered in training courses; it has debug logs that are nearly impossible to interpret; and in the early days of Windows 2000, FRS was unreliable and error-prone. As a workaround some companies disabled FRS service and implemented RoboCopy.
A couple of years ago, Microsoft provided a series of Perl scripts—
Topchk.cmd,
Iologsum.cmd, Connstat.cmd and List.exe—that provided much-needed parsing
for the logs produced by the NTFRSUTL.exe command-line tool. That helped—the
logs could now be formatted. But you still needed a Ph.D. in FRS to interpret
them. If you’ve ever tried to fix a null server reference object, figure
out why a domain controller is stuck in a vvjoin, or couldn’t fix a file
with an invalid parent GUID, you know what I mean.
How do you interpret information from those logs to solve Event Log errors? The answer, of course, was to call technical support and spend money and time getting help. Recently, though, Microsoft has empowered us to do a higher level of troubleshooting through the release of several powerful tools and an incredible help file.
In the next few pages, we’ll explore some FRS basics, review the top
FRS issues identified by Microsoft, and cover how they’re addressed in
Win2K Service Packs and Windows Server 2003. Finally, we’ll take a closer
look at powerful tools like Sonar, Ultrasound, FRSDiag, and the Ultrasound.chm
help file.
A Brief FRS Overview
FRS was implemented in Win2K to replicate the contents of GPOs and scripts.
It’s also used by the Distributed File System (DFS) for data synchronization
between assigned members in a replica set.
FRS communicates with replication partners to determine when changes
are made to the replica set (Sysvol or DFS) and then replicates that data
to all downstream partners. It’s a multi-threaded, multi-master replication
engine. FRS relies on Active Directory for its replication topology (NTDS
connection objects) and specific replica set information, such as partners.
FRS is dependent upon AD objects and AD Replication, which in turn depends
on Connectivity, DNS and Remote Procedure Calls. This is vital to remember
when troubleshooting.
Common FRS Problems and Solutions
Before you start troubleshooting FRS, make sure you have the latest service
pack plus any FRS-specific hotfixes. Let’s look at some of the common
problems and how Microsoft has solved them, or at least made FRS more
tolerant.
Junction Points
Also referred to as reparse points, directory junctions and volume mount
points, a junction point is a physical location on a hard disk that points
to another location on a disk or storage device. Think of junction points
as links in the file system, sort of a tunnel that binds two ends into
one; it connects two locations on the disk to each other.
Removal of a junction point will cause FRS replication to fail. Likewise,
copying the junction point will create another Sysvol tree.
Morphed Directories
Morphed Directories and files have been replicated to a target that already
has an exact copy of them. FRS can’t tell which one is most recent, so
it creates a duplicate copy, referred to as a “morph.” These duplicate
directories or files are renamed by prefixing the name with NTFRS_xxxxxxxx,
where “xxxxxxxx” is a random eight-digit number. This usually occurs if
an Authoritative Restore (discussed later) takes place, forcing an entire
Sysvol tree to multiple replica set members at the same time. The administrator
must decide which is the newest, most correct version to keep. If it’s
the morphed version, delete the original and rename the morphed folder
by eliminating the NTFRS_xxxxxxxx prefix. If it’s the original, delete
the morphed version. Morphed directory contents aren’t replicated; if
it’s more recent data, you may lose changes if not resolved. For more
information, see Knowledge Base 328492, “Folder Name Is Changed to “FolderName_NTFRS_.”
Parallel Version Vector Joins
When a new DC joins the domain, a “version vector” is created and distributed
from the new DC to each of the other DCs in the domain, to make sure each
of the replication partners has the right version of the Sysvol data.
In Win2K, this process caused a lot of grief because it pulled the entire
Sysvol tree from every DC in the domain at the same time, in parallel.
This caused problems not only in network performance but in DC performance,
since it has the potential for taking a DC offline during the process.
Windows 2003 and Win2K SP3 have corrected this by making it a serialized
process. The new DC will do a Version Vector Join (Vvjoin) during promotion;
then, after completion, it will contact other DCs in the domain, one at
a time, for changes. If the source DC is up to date, the Vvjoin is still
done to the others, but no replication takes place.
Staging Area Problems
This is an oldie but a goodie; however, there are still many administrators
not aware of this important issue. Changes made to files in Sysvol are
copied to temporary files in two staging direc-tories: %systemroot%\sysvol\staging\
domain and %Systemroot%\sysvol\ staging areas\. The files
stay there until all downstream partners have pulled it.
But some programs that scan the files, such as anti-virus and defragmenter programs, modify the security descriptors of the files. This forces a change order, causing all files in the Sysvol tree to be copied to the two staging directories. Setting File System Policy in a Group Policy to apply to the Sysvol tree does the same thing. Prior to SP3 and Windows 2003, this resulted in huge numbers of files being dumped into the staging directories, exceeding the 660MB limit and causing FRS replication to stop. There’s a Registry key to increase this limit, but that’s just to give you some breathing room until you can resolve the problem (see KB 264822, “File Replication Service Stops Responding When Staging Area is Full”).
Note: Most antivirus vendors now have FRS-friendly versions
of their products. If you ask and they don’t know whether it’s FRS compatible,
find another vendor: This is a well-known problem and they should have
a solution. For more information, see KB 815263, “Antivirus, Backup, and
Disk Optimization Programs That Are Compatible with the File Replication
Service”.
Microsoft’s made improvements on this issue in Win2K SP3 and Windows 2003 in two ways.
1. Reduction of excessive FRS replication (see KB 811370, “Issues That Are Fixed in the Post-Service Pack 3 Release of Ntfrs.exe”). FRS detects these unnecessary updates to the files (presumably based on frequency) and suppresses the updates. The administrator is notified with event ID 13567 in the NTFRS event log. This was available as a Win2K post-SP3 hotfix (811370) as well as Windows 2003. It’s described in KB 315045, “FRS Event 13567 Is Recorded in the File Replication Service Event Log After You Install Service Pack 3.”
2. Replication isn’t stopped if the staging directory is filled (see
KB 307319, “Changes to the File Replication Service”). In Win2K SP3 and
Windows 2003, when the staging directory reaches 90 percent capacity,
the oldest files are deleted until it’s reduced to 60 percent, thus preventing
replication from stopping and taking the DC offline. Note that this isn’t
a fix; the fix is to find out what’s causing the huge volume of files
to be dumped into the staging area.
Journal Wrap
The NTFS Change Journal, which FRS uses to identify changes made to Sysvol
data, was simply increased to 128K in Win2K SP3 and 512MB in SP4, a dramatic
increase over the Win2K RTM limit of just 32MB. This should significantly
reduce the opportunity for experiencing journal wrap errors and the resulting
non-authoritative restore.
Authoritative and Non-Authoritative Restore
Authoritative and Non-Authoritative Restore in FRS aren’t related to authoritative
and non-authoritative restore in AD. In FRS-speak, these terms refer to
a restore of the Sysvol tree only. They use a Registry key—BurFlags (for
backup and restore flags)—to modify FRS behavior. Located at:
HkeyLocalMachine\System\Current ControlSet\Services\Ntfrs\Parameters
\Backup/Restore\Process at Startup
the BurFlags Dword value is set to trigger FRS replication. Setting it to D2 on two machines performs a non-authoritative restore. Setting a source to D4 and all other satellite DCs to D2 forces the satellites to pull from the source, causing a full synchronization among DCs.
Warning: Be very careful with using both type of restores, as improper action will be dangerous to your DCs health (and yours too, if you take down the domain). Always find the root cause before proceeding with this process.
Authoritative Restore. Authoritative restore, sometimes referred
to as “D4” because of the BurFlags setting used to enable it, uses a “big
hammer” approach to getting Sysvol on all DCs in sync with a single source.
Though Microsoft now says D4 was never intended to be a “silver bullet”
solution to FRS issues, it was used extensively during the days when anti-virus
products were first found to be filling the staging areas. Today there
probably aren’t a lot of valid reasons to do an authoritative restore.
Authoritative restore leaves the file structure in tact and simply backs up and restores Sysvol data. It assumes that all DCs in the domain hold corrupt or incomplete copies of the Sysvol tree and that the NTFRS database is corrupt. This needs to be investigated and resolved to prevent this situation from reoccurring.
Non-Authoritative Restore. Sometimes called “D2,” non-authoritative
restore is the “little hammer” approach. Unlike the authoritative restore
that syncs all DCs to a common source, non-authoritative restore syncs
one out-of-date DC with an up-to-date source. Thus, only one source and
one satellite are involved. This is less intrusive than the Authoritative
Restore because it can only mess up two DCs, rather than all of them.
Unlike Authoritative restore, there are good reasons for using this.
When a serious FRS error occurs such as a Journal Wrap error, Win2K behavior
is to automatically perform a non-authoritative restore on the DC that
experiences the error. Since this takes both DCs offline for a time, Windows
2003 doesn’t do this automatically. Instead, it flags the condition with
an event ID 13568 to allow the administrator to perform this at a convenient
time.
Diagnosis and Troubleshooting
There are a couple of ways to test the overall health of FRS. A good way
to see who’s replicating to whom is to create an empty text file, name
it after the DC it’s on (i.e., dc1.txt) and place it in the %systemroot%\sysvol\sysvol
directory. Do this on every DC in the domain, then wait for end-to-end
replication to occur. Every DC should have a text file from every other
DC. For instance, if there are four DCs in the domain, DC1, DC2, DC3,
and DC4, you would create dc1.txt on DC1, dc2.txt on DC2, and so on. After
replication, each DC should have dc1.txt, dc2.txt, dc3.txt, and dc4.txt.
If DC4 is missing DC1.txt, there’s an inbound replication problem from
DC1 to DC4.
There are a variety of ways to collect logs on suspect DCs: The NTFRS_xxxxxx.log files in %systemroot%\debug; those generated by NTFRSUTL.exe; and the Event Logs. The problem is interpreting them. This takes experience and in-depth of knowledge to apply that information and resolve the problem. Microsoft now provides four powerful tools to help the average admin diagnose and troubleshoot FRS problems:
Sonar. This tool monitors FRS data such as file backlog, errors,
missing Sysvol shares, and so on for all DCs in the domain (see Figure
1). Findings are presented in a table format with options for refresh
frequency and categories such as replication status.
|
Figure 1. The Sonar troubleshooting tool monitors
FRS data such as file backlog, errors and missing Sysvol shares. (Click
image to view larger version.) |
Ultrasound. This tool goes beyond Sonar. It hooks to a SQL database
(Microsoft SQL Server Desktop Engine will work) and provides historical
data. It also has a feature that can send e-mail in the event of a failure,
and other goodies.
FRSDiag.exe. As shown in Figure 2, it allows you to click check
boxes for the types of data you want, then runs the appropriate utility
to get it. It’s like customizable MPS Reports in that regard. It also
produces an FRSDiag.txt file, similar to the DCDiag.exe tool used for
AD diagnostics.
|
Figure 2. The FRSDiag.exe tool lets you customize
the report data from a variety of sources. (Click image to view larger
version.) |
Ultrasound Help File. Simple, yet perhaps the most powerful of
all the tools, this file is powerful because Microsoft’s channeled its
experience and knowledge into providing descriptions, causes and solutions
to errors and problem conditions. It also contains FRS operation basics,
terminology and information about the previously discussed tools.
The Ultrasound Help File thus becomes a desktop reference for all FRS events, errors and problem conditions. It’s extremely powerful in helping resolve FRS issues without involving tech support. Figure 3 shows one of my favorites—the Event ID list. All FRS related event IDs are in the left pane. In this example I selected Event 13568, the Journal Wrap error. The right pane describes the description and the resolution. No searching the Microsoft Knowledge Base or Google. It’s right there.
|
Figure 3. The Ultrasound Help File is one of
the best new things about FRS. It's comprehensive and easy to understand.
(Click image to view larger version.) |
Another powerful feature of the Help File is the FRS Troubleshooting
section. Figure 4 shows a table showing how to interpret the event IDs.
Note how it has key phrases like “Servers Missing Inbound Connections”
and provides details on how to troubleshoot this error. Thus you can take
information gleaned from FRSDiag.exe and look it up here. This file is
available as a separate download at www.microsoft.com/downloads.
Click on the FRS Monitoring Help File link.
|
Figure 4. The Ultrasound Help File at work. Here
it not only shows the problem ("Servers Missing Inbound Connections")
but gives possible causes. (Click image to view larger version.) |
Another helpful document is the “FRS Technical Reference” found at www.microsoft.com/technet,
which contains much of the Help File contents.
Give FRS Another Chance
FRS is stable and reliable if you’re running at least at Win2K SP3 or
Windows 2003. There are fairly sophisticated tools for monitoring and
diagnosis. There are also a lot of useful articles in Microsoft’s Knowledge
Base. If you’ve been bitten in the past by FRS problems, give it another
chance. If you’re using RoboCopy as a substitute, compare it to FRS; you
just might go back.
Special thanks to Chris Jaramillo of HP and Dan Boldo of Microsoft
for their contributions to this article.
This excerpt is from the forthcoming book, Windows 2003 and ProLiant
Servers, by Gary Olsen and Bruce Howard. All rights reserved. Published
with permission from Prentice Hall Professional Technical Reference.
About the Author
Gary is a Solution Architect in Hewlett-Packard's Technology Services organization and lives in Roswell, GA. Gary has worked in the IT industry since 1981 and holds an MS in Computer Aided Manufacturing from Brigham Young University. Gary has authored numerous technical articles for TechTarget (http://searchwindowsserver.techtarget.com), Redmond Magazine (www.redmondmag.com) and TechNet magazine, and has presented numerous times at the HP Technology Forum, TechMentors Conference and at Microsoft TechEd 2011. Gary is a Microsoft MVP for Directory Services and is the founder and President of the Atlanta Active Directory Users Group (http://aadug.org).