Once again this one is from one of the cases that got escalated to me and it was a very interesting case. I m putting my probing questions that will explain how we narrow down the issue as issue was little misunderstood in the beginning and troubleshooting was going around name resolution not working with Teredo clients.
Probing discussion with UAG Admin to narrow down the problem he was facing.
Questions and respective answers
1. What's the issue with DA(direct Access).
Answer. Client's DA connectivity does not work at the time of issue, when we are using teredo.
2.Does the problem happen when they use Iphttps on all the users
Answer : No
3. Does problem happen only with teredo?
Ans: yes
BUT only when under extreme or heavy loads and not during usual load.
After this discussion, it was clear even teredo works. But due to some reason it breaks under heavy load.
Benefit of putting these questions here in my post, just to point out that effective probing can help you narrow down the issue as they were troubleshooting name resolution and DNS proxy on UAG before I was engaged and direction of troubleshooting was incorrect
So to dig deeper ,I took scenario tracing as below from client
Run following two commands in the command prompt
- Netsh trace start scenario=directaccess capture=yes report=yes tracefile=C:\client.etl
- Netsh wfp capture start
Then
- net stop iphlpsvc (to stop IP helper service)
- net start iphlpsvc (to start IP helper service)
to initiate the DA connectivity again.
Then stopped the traces by running following two commands in the command prompt
- Netsh wfp capture stop
- Netsh trace stop
same steps(minus restarting of iphelper service) took on the server side to collect DA scenario tracing.
Then i checked server side captures and in the (\CabFolder\config\neighbors.txt) for teredo(since as we knew that load happens during high load) checked the number of teredo neighbours.
***************************
Internet Address Physical Address Type
-------------------------------------------- ----------------- -----------
x x x x x x xx x Reachable
x x x x x x xx x Unreachable
x x x x x x xx x Probe
found the number to be greater then 3000 and we know by default this is 256 as per http://technet.microsoft.com/en-us/library/ee844188(v=WS.10).aspx
Note the number in the Neighbor Cache Limit field, which by default is 256.
so we checked this value on the server using command
netsh interface ipv6 show global (on all the nodes)
As expected it was 256 i.e. default
then using following command
netsh interface ipv6 set global neighborcachelimit=Maximum
where maximum could be as per the requirement e.g. 6000, so after we increased this value to a higher value , issue never recurred.