Friday, August 23, 2013

Monitoring the Monitor

No, not this Monitor, this Monitor. Ever since it’s introduction, it’s been the subject of both scorn and praise… ok, ok, scorn. Recently, during our current field test for the next version of RAD Studio*, one of our long-time field-testers posted a reference to this excellent blog post by Eric Grange relating to the performance of the TMonitor vs. the Windows Critical Section. In that post, Eric makes a strong case against TMonitor. Since I’m somewhat personally invested in the TMonitor (read: I wrote it), I took that post as a challenge. I would have responded directly were it not for the fact that he’s closed comments for that post. No biggie. For the record, while Eric and I don’t always agree, he always has well reasoned and thorough arguments.

Eric published a simple test case to demonstrate what he was talking about. I took that test case and began to dig into it. First of all I looked into what Windows was doing in the critical section code down in the kernel. A quick spelunk with the CPU view was rather revealing. In this specific case, the critical section wasn’t ever actually blocking on a kernel-level object. In other words it was doing a busy-wait. Under the high-contention test case that Eric presented, the busy-wait was the magic that served to blow TMonitor out of the water! As a baseline here’s my result with Eric’s app unchanged:

Once I determined that a spin-lock was in play, it all became much clearer. TMonitor can (and always has) optionally do a busy-wait before blocking. You must manually set the spin count on a specific TMonitor instance by calling TMonitor.SetSpinCount(Obj, nnnn). So, the first thing I did was to add the following to the FormCreate() event handler:
  System.TMonitor.SetSpinCount(Self, 100000);

With that single change, suddenly things got a whole lot better:

We’re on the right track here… but not perfect. I experimented with various spin counts, but above 1000 there were diminishing returns. So, something else must be in play here. I decided to experiment with a raw spin-lock. Luckily Delphi already has one, System.SyncObjs.TSpinLock, to be exact. Instead of using a TMonitor, I changed the code to use a TSpinLock. Whoa! Look at these results:

Now we’re gettin' somewhere! Hmmm… so what is the difference here? Setting the spin count on TMonitor should be roughly equivalent to the raw TSpinLock… so looked at the code for each and then it hit me! Exponential backoff! Duh! The TSpinLock implements an exponential backoff that will yield or sleep every X iterations.** TMonitor doesn’t implement an exponential backoff for it’s busy-wait implementation. What is interesting is that the TMonitor does have an internal TSpinWait record that does implement exponential backoff. So I made a very small modification to TMonitor to now use the TSpinWait record within the TMonitor.Enter(Timeout: Cardinal); method within the System unit.

Added the following local var declaration:

  SpinWait: TSpinWait;

Added/Removed the following lines

  if SpinCount > 0 then
    StartCount := GetTickCount;
    SpinWait.Reset; //  Added
    while SpinCount > 0  do
    YieldProcessor; // Removed
    SpinWait.SpinCycle; // Added
    // Keep trying until the spin count expires

Now here’s the final result:

Hmm… I’d say that’s darn near the same. While I cannot say for certain, but I imagine the difference in the graphic patterns could be attributable to a difference in the exponential backoff algorithm and parameters. I also reduced the TMonitor spin count to 500 and got the same results. On my machine anything over 500 didn’t make any difference. Below about 250, it started doing kernel-level blocks and performance dropped.

With just that one simple change, TMonitor is matching and even sometimes now exceeding the performance of the Windows critical section. On the whole, I’d declare them equivalent. This does, however, mean you will need to explicitly set the spin count on the TMonitor to some value. Normally the “right” value is a guess based on empirical testing for your specific use case. I suspect the Windows critical section code selects some value based on the number of cores and CPU speed. At some point, I'll need to research an algorithm that can automatically select a reasonable spin count.

There is one thing I can say that TMonitor has over the Windows critical section… it works on all platforms (Windows, MacOS, iOS, and soon Android). Feel free to use whichever one you feel more comfortable using for your specific situation. Me? I will fully admit I’m biased…

*Yes, there is a field test going on and if you already have XE4 or purchase XE4, you become eligible to get access. You can go here to request access.

**This is not the same as blocking, it only forces the current thread to relinquish it’s current quanta (time slice) to allow another thread to run. This only adds thread-switching overhead without the added kernel object blocking overhead.