Debugging replication errors
The lion’s share of Active Directory problems are to some degree caused by replication failures, and one of the most notorious replication errors is the Event ID 1311.
The first step in resolving this replication error is to determine the scope of the error. The easiest way to do this is with theRepadmin/Replsum command. This will give you a complete summary of all the DCs in the forest, including the relevant event ID if it is in an error state. The general form of the command is this:
Repadmin /Replsum /bysrc /bydest /sort:delta
Here is a sample output of this command. Note that there are four domain controllers failing replication. While the 1311 may not show up in the output of this command, it is common for it to be paired up with the 1722 event (which basically means no physical connectivity). Obviously, if there is no physical connectivity (which would mean there was a network failure), replication isn’t going to happen. The first thing to do is to check the general health of the domain using the Repadmin /replsum command just described. You can also ping broken DCs by address and FQDN, and you can run NetDiag and DCDiag commands from the command line (with the /v switch on each). This will give you more details about the errors and perhaps related ones.
Note: The network connecting all the sites should be fully routed. Don’t create a site link if there is no underlying network link to get between the sites in the site link.
Logical connectivity is a bit more difficult to diagnose. It means, bottom line, that something in the AD site topology configuration is wrong, creating a hole in the topology. This could be solved by one of the following actions: configuring a preferred bridgehead server, making sure all sites are defined in site links and making sure there is a complete mesh of sites in site links.
DNS also must be taken into account. Since Active Directory replication relies on DNS name resolution to find DCs to replicate with, if DNS is broken, it could cause the 1311 events to occur. The helpful thing here is that if DNS is the culprit, the 1311 event will have the phrase “DNS Lookup Failure” included in the description. If you see this phrase, then you absolutely, positively have a DNS problem that must be fixed.
When debugging 1311 events, you should get a scope of the entire forest to see which DCs are not replicating. You can do this easily using the Repadmin /Replsum command. Note that the loss of physical connectivity, an incomplete AD site topology or DNS failure usually cause these events, with an outside chance it will be an orphaned object (an object that connot be found in the directory tree). Usually, other events will accompany them, such as the 1722 (RPC Server Unavailable), or the event will contain a descriptive statement such as “DNS Lookup Failure.” This is a critical event that must be resolved in order for Active Directory replication to function properly to all DCs.