Recently I was contacted by a colleague who was having issues with an Active Directory database. Whist there is nothing unusual in this colleague contacting me for help or vice-versa, this issue was beyond the norm.
What he had reported to me was that there was issues with the primary domain controller (PDC) and secondary domain controller (SDC) on this site having out of sync databases, which came to the fore as he was adding new devices (through WDSUtil) to be imaged, they appeared on the SDC but not on the PDC, with this causing issues predominantly with the fact they would image the machine, and get the correct name from the SDC which was also acting as the (Windows Deployment Services) WDS server but it would not bind to the domain, as there was no account for it on the PDC.
Upon further investigation (over the phone at this point) we discovered the the two domain controllers were out of sync and the tombstone had exipred, fixing this problem allowed for a partial sync as outlined below;
PDC==>SDC – Success
SDC==>PDC – Fail
PDC ==>SDC – Success
SDC==>PDC – Success
These tests were run from the “Active Directory Sites and Services” tool on the domain controllers as shown above.
Looking at the error logs it showed AD Domain Services errors of 1988 and an error stating
Active Directory Domain Services Replication encountered the existence of objects in the following partition that have been deleted from the local domain controllers (DCs) Active Directory Domain Services database. Not all direct or transitive replication partners replicated in the deletion before the tombstone lifetime number of days passed. Objects that have been deleted and garbage collected from an Active Directory Domain Services partition but still exist in the writable partitions of other DCs in the same domain, or read-only partitions of global catalog servers in other domains in the forest are known as “lingering objects”
It did also give a whole bunch of sensitive information (hence I will not publish it) stating the object that was causing it. Looking for the cause of the error I came across the repadmin (AD Replication Admin command line tool)- repadmin /removelingeringobjects ServerWithLingeringObjects CleanServerGUID NamespaceContainingLingeringObject which I ran, and I ran the replication tests again and got the same results.
So figuring I had nothing to loose I deleted the object that was referenced in the error, which in my case was a user, so I do this and try the replication again. This time I got an error stating that “An internal error occurred”, great what next. Looking at the error logs again (on the PDC, as by this time I was pretty sure it was the PDC that was causing the issues) I found an error of 467 meaning a corrupt database…. Oh SHIT… ok not that bad really but still.
I decided that I would try to repaid the database directly rather than using ADRM on the server (as I only had remote access). I stopped the Active Directory Domain Services – service in the Services Manager (services.msc) and knowing that the AD database is a JET database and that it is stored in C:\Windows\NTDS (NTDS Stands for NT Directory Services) I copied the file ntds.dit (the AD Database itself) to the desktop twice (two different file names, one to work on one to back up)
So once I had the two files I ran a verify on the database through the command esentutl /g C:\Users\<USER>\Desktop\ntds.dit the results coming back that the database is in fact corrupt so I ran the fix esentutl /p C:\Users\<USER>\Desktop\ntds.dit I then moved the fixed file back to C:\Windows\NTDS, restarted the Active Directory Domain Services – service in the Services Manager (services.msc) ran the replication tests again, and they all passed
Crisis averted, and I am now owed a good bottle of Scotch Whisky
This was all done over a remote session so it is possible