Hello folks, this is Herbert from the Directory Services support team in Europe!
Kerberos is becoming increasingly mandatory for really cool features such as Protocol Transition. Moreover, as you might be painfully aware, managing Service Principal Names (SPN’s) for the use of Kerberos by applications can be daunting at times.
In this blog, we will not be going into the gory details of SPNs and how applications are using them. In fact, I’m assuming you already have some basic knowledge about SPN’s and how they are used.
Instead, we’re going to talk about an interesting behavior that can occur when an administrator is doing their due diligence managing SPN’s. This behavior can arise when you are checking the status of the account the SPN is planned for, or when you are checking to see if the SPN that must be registered is already registered in the domain or forest.
As we all know, the KDC’s cannot issue tickets for a particular service if there are duplicate SPN’s, and authentication does not work if the SPN is on the wrong account.
Experienced administrators learn to use the SETSPN utility to validate SPNs when authentication problems occur. In the Windows Server 2008 version of SETSPN, we provide several options useful to identifying duplicate SPNs:
- If you want to look for a duplicate of a particular SPN: SETSPN /q <SPN>
- If you want to search for any duplicate in the domain: SETSPN /x
You can also use the “/f” option to extend the duplicate search to the whole Forest. Many Active Directory Admins use this as a proactive check of the forest for duplicate SPNs.
So far, so good…
The Problem
Sometimes, you’ll get an error running SETSPN -x -f:
c:\>SETSPN -X -F -P
Checking forest DC=contoso,DC=com
Operation will be performed forestwide, it might take a while.
Ldap Error(0x55 -- Timeout): ldap_get_next_page_s
“-P” just tells the tool not to clutter the output with progress indications, but you can see from that error message that we are not talking only about Kerberos anymore. There is a new problem.
What are we seeing in the diagnostic data?
In a network trace of the above you will see a query against the GC (port 3268) with no base DN and the filter “(servicePrincipalName=*)”. SETSPN uses paged queries with a page size of 100 objects. In a large Active Directory environment this yields quite a number of pages.
If you look closely at network capture data, you’ll often find that Domain Controller response times slowly increase towards the end of the query. If the command completes, you’ll sometimes see that the delay is longest on the last page returned. For example, when we reviewed data for a recent customer case, we noted:
”Customer also noticed that it usually hangs on record 84.”
Troubleshooting LDAP performance and building custom queries calls for the use of the STATS Control. Here is how you use it in LDP.exe:
Once connected to port 3268 and logged on as an admin, you can build the query in the same manner as SETSPN does.
1. Launch LDP as an administrator.
2. Open the Search Window using Browse\Search or Ctrl-S.
3. Enter the empty base DN and the filter, specify “Subtree” as the scope. The list of attributes does not matter here.
4. Go to Options:
5. Specify an “Extended” query as we want to use controls. Note I have specified a page size of 100 elements, but that is not important, as we will see later. Let’s move on to “Controls”:
5. From the List of Controls select “Search Stats“. When you select it, it automatically checks it in.
6. Now “OK” your way out of the “Controls” and “Options” dialogs.
7. Hit “Run” on the “Search” dialog.
You should get a large list of results, but also the STATS a bit like this one:
Call Time: 62198 (ms)
Entries Returned: 8508
Entries Visited: 43076
Used Filter: (servicePrincipalName=*)
Used Indices: idx_servicePrincipalName:13561:N
Pages Referenced : 801521
Pages Read From Disk : 259
Pages Pre-read From Disk : 1578
Pages Dirtied : 0
Pages Re-Dirtied : 0
Log Records Generated : 0
Log Record Bytes Generated: 0
What are these stats telling us?
We have a total of 8508 objects in the “Entries Returned” result set, but we have visited 43076 objects. That sounds odd, because we used an Index “idx_servicePrincipalName”. This does not really look as if the query is using the index.
So what is happening here?
At this point, we experience the special behavior of multi-valued non-linked attributes and how they are represented in the index. To illustrate this, let me explain a few data points:
1. A typical workstation or member server has these SPNs:
servicePrincipalName:
WSMAN/herbertm5
servicePrincipalName:
WSMAN/herbertm5.europe.contoso.com
servicePrincipalName:
TERMSRV/herbertm5.europe.contoso.com
servicePrincipalName:
TERMSRV/HERBERTM5
servicePrincipalName:
RestrictedKrbHost/HERBERTM5
servicePrincipalName:
HOST/HERBERTM5
servicePrincipalName:
RestrictedKrbHost/HERBERTM5.europe.contoso.com
servicePrincipalName:
HOST/HERBERTM5.europe.contoso.com
2. When you look at the result set from running setspn, you notice that you’re not getting all of the SPNs you’d expect:
dn:CN=HQSCCM2K3TEST,OU=SCCM,OU=Test Infrastructure,OU=Domain Management,DC=contoso,DC=com
servicePrincipalName: WSMAN/sccm2k3test
servicePrincipalName: WSMAN/sccm2k3test.contoso.com
If you look at it closely, you notice all the SPN’s start with characters very much at the end of the alphabet, which also happens to be the end of the index. These entries do not have a prefix like “HOST”.
So how does this happen?
In the resultant set of LDAP queries, an object may only appear once, but it is possible for an object to be in the index multiple times, because of the way the index is built. Each time the object is found in the index, the LDAP Server has to check the other values of the indexed attribute of the object to see whether it also matches the filter and thus was already added to the result set. The LDAP server is doing its diligence to avoid returning duplicates.
For example, the first hit in the index for the above workstation example is “HOST/HERBERTM5“.
The second hit“HOST/HERBERTM5.europe.contoso.com“ kicks off the algorithm.
The object is read already and the IO and CPU hit has happened.
Now the query keeps walking the index, and once it arrives at the prefix “WSMAN”, the rate of objects it needs to skip approaches 100%. Therefore, it looks at many objects and little additional objects in the result set.
On the last page of the query, things get even worse. There is an almost 100% rate of duplicates, so the clock of 60 seconds SETSPN allows for the query is ticking, and there are only 8 objects to be found. If the Domain Controller has a slow CPU or the objects need to be read from the disk because of memory pressure, the SETSPN query will probably not finish within a minute for a large forest. This results in the error Ldap Error(0x55 -- Timeout): ldap_get_next_page_s. The larger the index (meaning, the more computers and users you have in your forest), the greater the likelihood that this can occur.
If you run the query with LDIFDE, LDP or ADFIND you will have a better chance the query will be successful. This is because by default these tools do not specify a time-out and thus use the values of the Domain Controller LDAP Policy. The Domain Controller LDAP policy is 120 seconds (by default) instead of 60 seconds.
The problem with the results generated by these tools is that you have to correlate the results from the different outputs yourself – the tools won’t do it for you.
So what can you do about it?
Typically you’ll have to do further troubleshooting, but here are some common causes/resolutions that I’ve seen:
- If a domain controller is short on memory and encounters many cache misses and thus substantial IO. You can diagnose this using the NTDS performance counters in Performance Monitor. You can add memory to reduce the IO rate and speed things up.
- If you are not experiencing memory pressure, the limiting factor could be the “Single-Thread-Performance” of the server. This is important as every LDAP query gets a worker thread and runs no faster than one logical CPU core can manage. If you have a low number of logical cores in a system with a high amount of CPU activity, this can cause the threads to delay long enough for us to see an nconsistent query return. In this situation your best bet is to look for ways to reduce overall processor load on the domain controller – for example, moving other services off of the machine.
- There is an update for Windows Server 2012 which helps to avoid the problem:
2799960 Time-out error when you run SETSPN.exe in Windows 8 or Windows Server 2012
http://support.microsoft.com/kb/2799960/EN-US
The last customer I helped had a combination of issues 1 and 2 and once he chose a beefier DC with more current hardware, the command always succeeded. Another customer had a much bigger environment and ended up using the update I listed above to overcome the issue.
I hope you have enjoyed this journey explaining what is happening on such a SETSPN query.
Cheers,
Herbert "The Thread Master" Mauerer