We had our first significant outage with ADFS this weekend. During a Sunday morning change control we updated the communication certificates on all our STS and Proxy servers and promoted a newer signing certificate from secondary to primary, following the directions at AD FS 2.0: How to Replace the SSL, Service Communications, Token-Signing, and Token-Decrypting Certificates. As our PKI infrastructure was recently changed the new signing certificate chained up to a new root, but all of our Dev and QA tests were successful on the new chain.
All changes tested out successfully; our relying parties that only trust one certificate had switched to trusting the new signing certificate and users could still access the relying parties. So the change control was closed.
Monday morning we received notification that users connecting externally were receiving an error message rather than getting to the Forms-Based Logon page. What was odd for this outage was that all our internal access to ADFS was fine, it was only external access through the proxy servers having issues.
The proxy servers ADFS logs were filling with Event ID 364 errors:
Encountered error during federation passive request.
System.ServiceModel.Security.MessageSecurityException: An unsecured or incorrectly secured fault was received from the other party. See the inner FaultException for the fault code and detail. ---> System.ServiceModel.FaultException: An error occurred when verifying security for the message.
--- End of inner exception stack trace ---
Server stack trace:
at System.ServiceModel.Channels.SecurityChannelFactory`1.SecurityRequestChannel.ProcessReply(Message reply, SecurityProtocolCorrelationState correlationState, TimeSpan timeout)
at System.ServiceModel.Channels.SecurityChannelFactory`1.SecurityRequestChannel.Request(Message message, TimeSpan timeout)
at System.ServiceModel.Dispatcher.RequestChannelBinder.Request(Message message, TimeSpan timeout)
at System.ServiceModel.Channels.ServiceChannel.Call(String action, Boolean oneway, ProxyOperationRuntime operation, Object ins, Object outs, TimeSpan timeout)
at System.ServiceModel.Channels.ServiceChannelProxy.InvokeService(IMethodCallMessage methodCall, ProxyOperationRuntime operation)
at System.ServiceModel.Channels.ServiceChannelProxy.Invoke(IMessage message)
Exception rethrown at :
at System.Runtime.Remoting.Proxies.RealProxy.HandleReturnMessage(IMessage reqMsg, IMessage retMsg)
at System.Runtime.Remoting.Proxies.RealProxy.PrivateInvoke(MessageData& msgData, Int32 type)
at Microsoft.IdentityServer.Protocols.PolicyStore.IPolicyStoreReadOnlyTransfer.GetState(String serviceObjectType, String mask, FilterData filter, Int32 clientVersionNumber)
at Microsoft.IdentityServer.PolicyModel.Client.PolicyStoreReadOnlyTransferClient.GetState(String serviceObjectType, String mask, FilterData filter, Int32 clientVersionNumber)
System.ServiceModel.FaultException: An error occurred when verifying security for the message.
Our first troubleshooting activity was to restart the ADFS service on the proxy server. When we did that it logged an Event ID 248 error:
The federation server proxy was not able to retrieve the list of endpoints from the Federation Service at corp.sts.WIDGETS.com. The error message is 'An unsecured or incorrectly secured fault was received from the other party. See the inner FaultException for the fault code and detail.'.
Make sure that the Federation Service is running. Troubleshoot network connectivity. If the trust between the federation server proxy and the Federation Service is lost, run the Federation Server Proxy Configuration Wizard again.
Frustratingly, no inner FaultException was present.
We re-ran the Federation Server Proxy Configuration Wizard and it completed successfully but the same 248 error occurred at service start. We also verified the new signing cert did chain up to a root that the proxy server trusted. Turning up full debug on the proxy server did not provide any additional useful data.
On a functional proxy server one expects service start to result in Event ID 245
The federation server proxy retrieved the following list of endpoints from the Federation Service at 'https://corp.sts.WIDGETS.com:443/adfs/services/proxytrustpolicystoretransfer':
To help isolate the problem we configured a local hosts entry on the proxy server to bypass the load balancers and hit a single STS. We could NetMon trace the service start and see the SSL handshake and traffic going only to the expected internal STS. As we saw no traffic trying to go anywhere except to the STS we were fairly certain there wasn’t an issue with validating the new chain. But given the error message for receiving an incorrectly secured response and that we just changed all the certificates we were fairly certain the switchover was the problem, but we had yet to figure out the solution.
Finally we decided to try restarting the ADFS service on the STS the proxy server was using, even though that STS was not exhibiting any errors. So we restarted the STS and restarted the proxy, and the proxy service started without error. SUCCESS!!
We restarted the service on the other STSs in the pool and restarted our other proxy server and it started working as well.
My guess for what happened is that the proxy servers reached their 4 hour trust renewal cycle after the change control verification had completed. At that time, I am guessing the SOAP responses to the proxytrustpolicystoretransfer endpoint requests were still being signed with the old signing certificate when the proxy was expecting them to be signed with the new, hence the “incorrectly secured” error. I’m guessing the service restart forced the STS to pick the new certificate to use to sign its SOAP responses for the proxytrustpolicystoretransfer endpoint. I’m also guessing we missed this in Dev and QA because the proxy usage is a secondary use case and was likely tested after a service restart or server reboot on the STS.
We’re still waiting on our Microsoft PFE to return with root cause analysis to see if Microsoft acknowledges a bug in the certificate handling. But for now, the short story is to cycle the STS services when rolling to new certificates.
I have updated the instructions in the AD FS 2.0: How to Replace the SSL, Service Communications, Token-Signing, and Token-Decrypting Certificates wiki article to include the STS service restart step.