Follow-up of PIAB and WCF Article

Posted by Hugh Ang at 9/23/2008 11:09:00 PM

Since publishing the MSDN article on an approach to integrate PIAB into WCF, I have received quite a few feedback from folks using this approach in production. And I am glad to say it's working out successfully for those folks.

As I mentioned earlier in the comment, Tom Hollander of p&p group has found an issue with our approach when two PIAB-enabled WCF services are hosted in the same IIS worker process. He graciously shared his code with us and I was able to reproduce the problem when the two WCF services were hosted in two separate AppDomains of the same IIS worker process. Although in production deployment scenarios, different WCF services would likely be hosted in separate processes or on different machines, I have still been wanting to figure out the root cause of this particular issue. Unfortunately with work and other things in my life, I just haven't had time until last week when I was finally able to sit down and focused on this problem.

I have to admit this has been the most tedious debugging I have ever done as I was deep in the guts of WCF and CLR, inspecting dynamic types, JIT compiler, call stubs, etc. without a whole lot of background in this area - I only wish I were working for the CLR team :-) After a few days, I finally found out the cause of the problem, which is pretty close to my initial hunch. A sense of elation at last!

Here it goes.

1. The setup to repro the problem
The setup is fairly straightforward. There are two WCF services, each implementing IService and IAnotherService respectively (the code is pretty much verbatim from Tom):


[ServiceContract]

public interface IService

{

    [OperationContract]

    Foo GetFoo(int id);

 

    [OperationContract]

    void AddFoo(int i, [NotNullValidator] Foo foo);

}




[ServiceContract]

public interface IAnotherService

{

    [OperationContract]

    void LogFoo(Foo foo);

}



Both services would be enabled with PIAB of course:


[PolicyInjectionBehaviors.PolicyInjectionBehavior]

[ValidationCallHandler]

[LogCallHandler]

public class Service : IService

{

    private static Dictionary<int, Foo> store = new Dictionary<int, Foo>();

    IAnotherService anotherService;

 

    public Service()

    {

        BasicHttpBinding binding = new BasicHttpBinding();

        binding.SendTimeout = new TimeSpan(4, 0, 0);

 

        anotherService = new AnotherServiceClient(binding, new EndpointAddress("http://localhost/AnotherTestService/AnotherTestService.svc"));

    }

 

    public void AddFoo(int id, Foo foo)

    {

        store[id] = foo;

        anotherService.LogFoo(foo);

    }

 

    public Foo GetFoo(int id)

    {

        if (store.ContainsKey(id))

        {

            anotherService.LogFoo(store[id]);

            return store[id];

        }

        else

            return null;

    }

}




[PolicyInjectionBehaviors.PolicyInjectionBehavior]

[ValidationCallHandler]

[LogCallHandler]

public class AnotherService : IAnotherService

{

    public void LogFoo(Foo foo)

    {

        Logger.Write("LogFoo() called.");

    }

}



Foo is simply a DataContract object that holds both an int and a string properties.

Now the services would be hosted two AppDomains in the same IIS worker process. This is how it looks like on my Vista machine:



As you can see both services are in the DefaultAppPool. With both services set up, we can run the test harness, which first calls Service.AddFoo() and then Service.GetFoo(). The Service.Foo() is completed fine but Service.GetFoo() call fails with a System.Reflection.TargetException: Object does not match target. The stack trace is as follows:


1017e344 79644832 System.Reflection.RuntimeMethodInfo.CheckConsistency(System.Object)
1017e350 793a4124 System.Reflection.RuntimeMethodInfo.Invoke(System.Object, System.Reflection.BindingFlags, System.Reflection.Binder, System.Object[], System.Globalization.CultureInfo, Boolean)
1017e39c 793a40a2 System.Reflection.RuntimeMethodInfo.Invoke(System.Object, System.Reflection.BindingFlags, System.Reflection.Binder, System.Object[], System.Globalization.CultureInfo)
1017e3bc 0f96e699 Microsoft.Practices.EnterpriseLibrary.PolicyInjection.RemotingInterception.InterceptingRealProxy+<>c__DisplayClass1.b__0(Microsoft.Practices.EnterpriseLibrary.PolicyInjection.IMethodInvocation, Microsoft.Practices.EnterpriseLibrary.PolicyInjection.GetNextHandlerDelegate)
1017e3ec 0f968ed1 Microsoft.Practices.EnterpriseLibrary.PolicyInjection.HandlerPipeline.Invoke(Microsoft.Practices.EnterpriseLibrary.PolicyInjection.IMethodInvocation, Microsoft.Practices.EnterpriseLibrary.PolicyInjection.InvokeHandlerDelegate)
1017e404 0f968a2e Microsoft.Practices.EnterpriseLibrary.PolicyInjection.RemotingInterception.InterceptingRealProxy.Invoke(System.Runtime.Remoting.Messaging.IMessage)
1017e418 79374dc3 System.Runtime.Remoting.Proxies.RealProxy.PrivateInvoke(System.Runtime.Remoting.Proxies.MessageData ByRef, Int32)
1017e6b4 79f98b43 [TPMethodFrame: 1017e6b4] WcfPiabInstability.Services.IAnotherService.LogFoo(WcfPiabInstability.Services.Foo)
1017e6c4 0efd08dc DynamicClass.SyncInvokeGetFoo(System.Object, System.Object[], System.Object[])
1017e6d4 50b8d90b System.ServiceModel.Dispatcher.SyncMethodInvoker.Invoke(System.Object, System.Object[], System.Object[] ByRef)
1017e74c 50b6d245 System.ServiceModel.Dispatcher.DispatchOperationRuntime.InvokeBegin(System.ServiceModel.Dispatcher.MessageRpc ByRef)
1017e7a0 509137ad System.ServiceModel.Dispatcher.ImmutableDispatchRuntime.ProcessMessage5(System.ServiceModel.Dispatcher.MessageRpc ByRef)
1017e7e0 509136a6 System.ServiceModel.Dispatcher.ImmutableDispatchRuntime.ProcessMessage4(System.ServiceModel.Dispatcher.MessageRpc ByRef)
1017e80c 50913613 System.ServiceModel.Dispatcher.ImmutableDispatchRuntime.ProcessMessage3(System.ServiceModel.Dispatcher.MessageRpc ByRef)
1017e81c 50913459 System.ServiceModel.Dispatcher.ImmutableDispatchRuntime.ProcessMessage2(System.ServiceModel.Dispatcher.MessageRpc ByRef)
1017e82c 50912257 System.ServiceModel.Dispatcher.ImmutableDispatchRuntime.ProcessMessage1(System.ServiceModel.Dispatcher.MessageRpc ByRef)
1017e844 50911f8f System.ServiceModel.Dispatcher.MessageRpc.Process(Boolean)
1017e888 509115ff System.ServiceModel.Dispatcher.ChannelHandler.DispatchAndReleasePump(System.ServiceModel.Channels.RequestContext, Boolean, System.ServiceModel.OperationContext)
1017ea34 5090f8c9 System.ServiceModel.Dispatcher.ChannelHandler.HandleRequest(System.ServiceModel.Channels.RequestContext, System.ServiceModel.OperationContext)
1017ea78 5090f35e System.ServiceModel.Dispatcher.ChannelHandler.AsyncMessagePump(System.IAsyncResult)
1017ea8c 5090f2f1 System.ServiceModel.Dispatcher.ChannelHandler.OnAsyncReceiveComplete(System.IAsyncResult)
1017ea98 50232d68 System.ServiceModel.Diagnostics.Utility+AsyncThunk.UnhandledExceptionFrame(System.IAsyncResult)
1017eac4 50904501 System.ServiceModel.AsyncResult.Complete(Boolean)
1017eb00 50992b36 System.ServiceModel.Channels.InputQueue`1+AsyncQueueReader[[System.__Canon, mscorlib]].Set(Item)
1017eb14 50992215 System.ServiceModel.Channels.InputQueue`1[[System.__Canon, mscorlib]].EnqueueAndDispatch(Item, Boolean)
1017eb7c 50991ffb System.ServiceModel.Channels.InputQueue`1[[System.__Canon, mscorlib]].EnqueueAndDispatch(System.__Canon, System.ServiceModel.Channels.ItemDequeuedCallback, Boolean)
1017eba4 5091d7e5 System.ServiceModel.Channels.SingletonChannelAcceptor`3[[System.__Canon, mscorlib],[System.__Canon, mscorlib],[System.__Canon, mscorlib]].Enqueue(System.__Canon, System.ServiceModel.Channels.ItemDequeuedCallback, Boolean)
1017ebc8 50977b7e System.ServiceModel.Channels.HttpChannelListener.HttpContextReceived(System.ServiceModel.Channels.HttpRequestContext, System.ServiceModel.Channels.ItemDequeuedCallback)
1017ec0c 5094f396 System.ServiceModel.Activation.HostedHttpTransportManager.HttpContextReceived(System.ServiceModel.Activation.HostedHttpRequestAsyncResult)
1017ec50 5094e4cf System.ServiceModel.Activation.HostedHttpRequestAsyncResult.HandleRequest()
1017ec68 5094defd System.ServiceModel.Activation.HostedHttpRequestAsyncResult.BeginRequest()
1017eca4 5094dea5 System.ServiceModel.Activation.HostedHttpRequestAsyncResult.OnBeginRequest(System.Object)
1017ecd0 50903c3c System.ServiceModel.Channels.IOThreadScheduler+CriticalHelper+WorkItem.Invoke2()
1017ed0c 50903b26 System.ServiceModel.Channels.IOThreadScheduler+CriticalHelper+WorkItem.Invoke()
1017ed20 50903ab5 System.ServiceModel.Channels.IOThreadScheduler+CriticalHelper.ProcessCallbacks()
1017ed54 5090390f System.ServiceModel.Channels.IOThreadScheduler+CriticalHelper.CompletionCallback(System.Object)
1017ed80 5090388b System.ServiceModel.Channels.IOThreadScheduler+CriticalHelper+ScheduledOverlapped.IOCallback(UInt32, UInt32, System.Threading.NativeOverlapped*)
1017ed8c 50232e1f System.ServiceModel.Diagnostics.Utility+IOCompletionThunk.UnhandledExceptionFrame(UInt32, UInt32, System.Threading.NativeOverlapped*)
1017edc0 79405534 System.Threading._IOCompletionCallback.PerformIOCompletionCallback(UInt32, UInt32, System.Threading.NativeOverlapped*)
1017ef60 79e7c74b [GCFrame: 1017ef60]
1017f0b8 79e7c74b [ContextTransitionFrame: 1017f0b8]


2. Culprit - a subtle bug in the CLR?
What is going on here? The exception is being thrown by System.Reflection.RuntimeMethodInfo.CheckConsistency() but the problem happened earlier. Notice the red in those two lines of the stack trace. DynamicClass.SyncInvokeGetFoo is the function generated by WCF using Lightweight Code Gen (LCG) to facilitate the GetFoo() call, which is in IService definition. But the next one on the call stack somehow becomes IAnotherService.LogFoo() - please be reminded that IAnotherService.LogFoo() call on the stack here is not to be confused with the one inside the Service.GetFoo() as the call hasn't reached to the actual object yet when exception happens. Using SOS command !clrstack -p and !dumpobject reveals that the object reference in the call context here is the PIAB proxy to the Service object (tp->rp->real object), which implements IService but not IAnotherService. The mismatch simply didn't manifest into an exception until later in System.Reflection.RuntimeMethodInfo.CheckConsistency(). So where the exception is thrown is not that important. We need to understand why IService.GetFoo() suddenly becomes IAnotherService.LogFoo().

Let's review the high level picture of how the calls are being dispatched from the client site (we will only consider synchronous calls for now):


  1. Client makes a call by sending an XML message to the WCF service

  2. WCF processes the message in the pipeline before it finally dispatches the call through System.ServiceModel.Dispatcher.SyncMethodInvoker.Invoke() as shown on the stack trace. System.ServiceModel.Dispatcher.InvokerUtil uses LCG to generate a delegate SyncInvokeXXXX, where XXXX is the target method name and SyncInvoke stands for synchronous call. SyncMethodInvoker.Invoke() simply passes control to SyncInvokeXXXX(). This part is not that difficult to figure out by using Reflector.

  3. CLR takes the buck from here. Starting with 2.0, .NET uses dispatch stub to handle interface calls. The runtime figures out the method disptach token and sends it along with the target object reference to mscorwks!ResolveWorkerAsmStub, which calls mscorwks!VirtualCallStubManager::ResolveWorkerStatic and then the heavy lifting mscorwks!VirtualCallStubManager::ResolveWorker to figure out the stub that contains the assembly code to make the actual call.

  4. PIAB proxy gets called. This is where the injection magic happens and service method finally gets called.



The dispatch token is a 32 bit integer with hi as the type id of the interface and lo as the slot number of the method as shown in the Rotor(SSCLI) source code:


 

static const UINT_PTR MASK_TYPE_ID       = 0x0000FFFF;

static const UINT_PTR MASK_SLOT_NUMBER   = 0x0000FFFF;

 

static const UINT_PTR SHIFT_TYPE_ID      = 0x10;

static const UINT_PTR SHIFT_SLOT_NUMBER  = 0x0;

 

//------------------------------------------------------------------------

// Combines the two values into a single 32-bit number.

static UINT_PTR CreateToken(UINT32 typeID, UINT32 slotNumber)

{

    LEAF_CONTRACT;

    CONSISTENCY_CHECK(((UINT_PTR)typeID & MASK_TYPE_ID) == (UINT_PTR)typeID);

    CONSISTENCY_CHECK(((UINT_PTR)slotNumber & MASK_SLOT_NUMBER) == (UINT_PTR)slotNumber);

    return ((((UINT_PTR)typeID & MASK_TYPE_ID) << SHIFT_TYPE_ID) |

            (((UINT_PTR)slotNumber & MASK_SLOT_NUMBER) << SHIFT_SLOT_NUMBER));

}



Type ids are integer identifiers to represent types in an AppDomain. Slot numbers are integer values representing entries of interface methods in the method table. In the example I am using, IService has a type id of 0x0003 and AddFoo has a slot number of 0x0001, therefore yielding a token of 0x00030001. IService.GetFoo has a dispatch token of 0x00030000. And IAnotherService.LogFoo also has a token of 0x0003000. You see that both IService and IAnotherService, living in two AppDomains, happen to have the same type id: 0x0003. It is a coincidence, but not to be ignored.

The reason why the dispatch token is critical here is because of the following:


  • all interface disptach stubs for our PIAB enabled services are handled by the VirtualCallStubManager in the shared domain of the process.

  • the stub manager keeps a hash table to cache the stub code using those two keys: token and object type. In our example, the object type is always a transparent proxy for PIAB-enabled services. So effectively token becomes the only key that matters.

  • the heavy lifting mscorwks!VirtualCallStubManager::ResolveWorker() is responsible for generating and caching the stub. Obviously it always first checks if there is a cached entry. If one is found using the keys, that entry will be returned.



When our unit test harness calls Service.AddFoo, which internally calls AnotherService.LogFoo, two dispatch stubs are created and cached by the shared stub manager, with tokens 0x00030001 and 0x00030000 as the effective key respectively. Now the unit test makes a different call Service.GetFoo. Note that IService.GetFoo also has the dispatch token as 0x00030000, same as IAnotherSevice.LogFoo, despite the fact they are two types in different domains. The stub manager of the shared domain hands out the previously cached dispatch stub for IAnotherService.LogFoo. This is why we saw the strange call stack above and the call eventually fails.

To further prove this, I changed the definition of IAnotherService to the following to include 5 additional functions as fillers to the method table slots:


[ServiceContract]

public interface IAnotherService

{

    // fillers

    void Filler1();

    void Filler2();

    void Filler3();

    void Filler4();

    void Filler5();

 

    [OperationContract]

    void LogFoo(Foo foo);

}



The purpose is to alter the slot number of IAnotherService.LogFoo to avoid the token clash. As shown in the following Windbg snippet, I did indeed get a token of 0x00030005 as opposed to the 0x0003000 that I had earlier:


eax=00000003 ebx=0e3f6228 ecx=79e89e87 edx=00000003 esi=00000005 edi=01c45b18
eip=79eb45cf esp=1057dd4c ebp=1057dd80 iopl=0 nv up ei pl nz ac po cy
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00000213
mscorwks!VirtualCallStubManager::GetCallStub+0x34:
79eb45cf c1e010 shl eax,10h
0:025> p
eax=00030000 ebx=0e3f6228 ecx=79e89e87 edx=00000003 esi=00000005 edi=01c45b18
eip=79eb45d2 esp=1057dd4c ebp=1057dd80 iopl=0 nv up ei pl nz ac pe nc
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00000216
mscorwks!VirtualCallStubManager::GetCallStub+0x37:
79eb45d2 0bf0 or esi,eax
0:025> p
eax=00030000 ebx=0e3f6228 ecx=79e89e87 edx=00000003 esi=00030005 edi=01c45b18
eip=79eb45d4 esp=1057dd4c ebp=1057dd80 iopl=0 nv up ei pl nz na pe nc
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00000206


And the CLR happily makes all the calls without the dreadful exception!

3. Summary
It is interesting to see how the CLR team may have tried to safeguard the clashing of the keys by using both token and the object type. In our case, the object type, being the transparent proxy for PIAB-enabled services, takes that key out of the equation. The token, although having HI corresponding to the interface type and LO to the slot number, is scoped within the AppDomain where the type is loaded. The fewer the number of types and the fewer the methods of the operation contracts are defined for the WCF service in each AppDomain, the bigger the odds key clashing like this will happen.

So what is the solution? Could other interface method related values such as MethodDesc, which seems to be unique across app domains, be a better candidate? That is the question for the CLR team.

As for us who want to integrate PIAB into WCF while minimizing development team efforts, we should be fine by hosting WCF services in separate processes. If you really, really want to host everything in one IIS worker process like our contrived example, you can get around this issue by not using the default PIAB interception mechanism. Suppose you can come up with a LCG mechanism to generate one dynamic proxy (instead of System.Runtime.Remoting.Proxies.__TransparentProxy) for each target. Having a different object type as each service's PIAB intercepting proxy will therefore avoid the key clashing.

4. Tools etc.
Debugging is always an enlightening experience. And I can't imagine what life is going to be without Windbg and Reflector. Visual Studio 2008 is cool since you can configure it to debug into .NET framework code. But if the symbols are not available for the module you want to investigate or you need to dig deeper below the FCL layer, Reflector and Windbg are just indispensable.

Also, having SSCLI source code is absolutely wonderful. Without it, debugging through machine code in windbg would be a lot harder.

Most of the SSCLI code for this exercise is located inside virtualcallstub.cpp file.

3 comments:

Twan said...

Hi Hugh,
Excellent post. I was just reading the original article and Tom's blog. I'm just wondering... does this clash still occurr in the latest bits (EntLib 4.1)? I've updated the original InstanceProvider following David's pointers in http://www.mtelligent.com/journal/2008/12/27/integrating-wcf-with-the-piab-in-entlib-41.html.

All things aside... Would you still do PIAB/WCF integration this way, or is there a better alternative now that WCF 3.5 and EntLib 4.1 are available.

Thanks in advance for your time,
Twan

Hugh Ang said...

Twan, the issue was due to the implementation of the CLR runtime as you can see in my post. Although I haven't tested it with EntLib 4.1, I suspect it will still happen. I haven't yet heard back from the Microsoft product team person I was in contact with.

You can still do the integration this way as long as you don't share two appdomains in one process.

Thanks
Hugh

Matthew said...

I just bumped into this today and was stumped until I found this. Excellent post. Please post an update if you hear anything back from the CLR team...