Tuesday, January 22, 2008

Breaking the rules

Sometimes you just have to break the rules in order to learn something new.  It is inherent in everyone, from the first time you ignored your parents about getting burned by the hot stove, to doing 10+ over the speed limit because "you can."  Even though I mentioned in this post that you should never go mucking with the fields of a critical section structure, I just could not keep my grimy little hands off :-).  One reason is that in Windows Vista, Microsoft introduced new condition variable APIs.  My previous posts were about creating a full monitor object which is the combination of a mutual-exclusion lock (critical section) and a condition variable (wait/notify/notifyall).  However there are cases where you would want to use more than one condition variable with the same lock.  That is what these new Windows APIs allow you to do.  After reading as much about the underlying implementation of these new bits of synchronization goodness, I found that they use a relatively new internal set of undocumented APIs for "Keyed Events."  You can read this post by Joe Duffy for a high-level explanation.  What if I could "backfill" these new APIs for older version of Windows within the DPL?  I would need to break some rules in order to do that.  I'd also have to forego any use of keyed events since those are only Windows XP and up (and not documented either).  Windows 2000 would be out of luck.  Let's get started with this hack-fest.  For now let's ignore the "slim" reader/writer locks.

There are four key APIs to duplicate and one structure.

type
PRTLConditionVariable = ^TRTLConditionVariable;
_RTL_CONDITION_VARIABLE = record
Ptr: Pointer;
end;
TRTLConditionVariable = _RTL_CONDITION_VARIABLE;

procedure InitializeConditionVariable(out ConditionVariable: TRTLConditionVariable);
procedure WakeConditionVariable(var ConditionVariable: TRTLConditionVariable);
procedure WakeAllConditionVariable(var ConditionVariable: TRTLConditionVariable);
function SleepConditionVariableCS(var ConditionVariable: TRTLConditionVariable;
var CriticalSection: TRTLCriticalSection; dwMilliseconds: DWORD): BOOL;


The condition variable structure has one field which is not much space to store state information.  My first thought was to simply allocate a structure that would be pointed to by the single field of the condition variable.  The problem is there is no Delete/Destroy/FreeConditionVariable API.  How would this structure be freed?  Rats! That's a bust.  This is where it is time to get all geeky and put on the propeller cap.  If you recall from my discussion of the TMonitor, the Wait method would insert a little structure into a linked list that represented all the current waiters.  We need to do something similar.  There is a little twist here as well because the WakeCondtionVariable and the WakeAllConditionVariable are not required to be called from within the lock.  In fact, all they take is the TRTLConditionVariable structure as the lone parameter.  This makes managing that internal list of waiters a lot more tricky since we're not guaranteed to be in a lock while we need to update that list.  This is compounded by the fact that the condition variable structure has only one pointer sized field.  My first thought was to use a lock-free queue.  The problem here is that we need two pointers, one to the head of the queue and one to the tail of the queue.  I only have one pointer.  Before we get into SleepConditionVariableCS where we're going to do some unholy things to the poor critical section structure, let's get the wait queue out of the way.


To set the stage we need a little review of field alignment.  Alignment of fields of a structure, the local variables on the stack and the data on the heap is very important to making sure reads and writes are fast and atomic.  Packing a structure that knocks the fields out of their "natural" alignment can sometimes have unwanted side-effects.  Yes, there are times where you will get a few "dead bytes" between fields.  A double word sized entity on the 32bit x86 is 4 bytes in size.  This means that the "natural" alignment of this entity is on even 4 byte boundaries.  As long as the entity falls on those boundaries, you know that the lowest 2 bits of any pointer to that entity will always be 0.  We can exploit that fact with a bit of (OMG! is he going to do what I think he's going to do!!???) propeller-head bit twiddling.  Now before you scroll down right now and hit the comment button and fire up the flame thrower, please bear with me.


Just like the Wait method of the TMonitor class, the SleepConditionVariableCS will simply place a structure allocated on the stack into a queue right before releasing the crit-sec, and block on an event.  Because interacting with the queue does not use any inherent locking we need a thread-safe queue.  We also only have that one field.  How do we "lock" the list while it is being updated?  The key is using one of those extra bits that we know will always be 0.  If one of those bits is not 0, the list is locked.  Here is the LockQueue and UnlockQueue functions:


type
PWaitingThread = ^TWaitingThread;
TWaitingThread = record
Next: PWaitingThread;
WaitEvent: THandle;
end;

//UPDATE: Slight modification based on comment
function LockQueue(var ConditionVariable: TRTLConditionVariable): PWaitingThread;
var
ReleaseSlice: Boolean;
begin
ReleaseSlice := GetProcessorCount < 2; // This local could be eliminated by using a global
while True do
begin
Result := PWaitingThread(Integer(ConditionVariable.Ptr) and not 1);
if TInterlocked.CompareExchange(Integer(ConditionVariable.Ptr),
Integer(Result) or 1, Integer(Result)) = Integer(Result) then
Break;
if ReleaseSlice then
SwitchToThread
else
asm
PAUSE
end;
end;
end;

procedure UnlockQueue(var ConditionVariable: TRTLConditionVariable; WaitQueue: PWaitingThread);
begin
ConditionVariable.Ptr := Pointer(Integer(WaitQueue) and not 1);
end;

It should be immediately obvious that the LockQueue function will never return until it can successfully lock the queue.  It should also be noted that it will also spin-lock indefinitely, which can burn lots of cycles.  Under a single proc system, doing a busy-wait is not cool so we should just release this thread's timeslice and let the scheduler get another thread to run.  (Sleep(0); is not appropriate since that only allows threads with equal or greater priority to run,  SwitchToThread is far more fair and much more like a traditional "yield").  Now that we have the tail of the linked list returned from LockQueue, we can manipulate the queue all we want.  We just cannot "touch" the ConditionVariable.Ptr field and can only operate with the value returned from LockQueue.  Once we're done we unlock the Queue by passing the potentially updated tail pointer to the UnlockQueue which will simply update the Ptr field of the ConditionVariable.  About the performance implications; In theory, the queue should never grow very (one entry per waiting thread per condition variable) deep so the queue operations should be fast enough to not adversely affect performance (although I know it will a little bit).  A final note about this technique is that it does not allow any type of recursion, so you much make sure there are no calls that could call LockQueue on the same condition variable while the lock is held.


If you haven't lost your lunch or broken out in hives by now, let's move on to the SleepConditionVariableCS function.  It is commented similar to that of the TMonitor.Wait function.


function SleepCriticalSectionCS(var ConditionVariable: TRTLConditionVariable;
var CriticalSection: TRTLCriticalSection; dwMilliseconds: DWORD): BOOL;
var
WaitingThread: TWaitingThread;
RecursionCount: Integer;
begin
if CriticalSection.OwningThread = GetCurrentThreadId then
begin
WaitingThread.Next := nil;
WaitingThread.Thread := CriticalSection.OwningThread;
WaitingThread.WaitEvent := CreateEvent(nil, False, False, nil);
try
// Save the current recursion count
RecursionCount := CriticalSection.RecursionCount;
// Add the current thread to the waiting queue
QueueWaiter(ConditionVariable, WaitingThread);
// Set it back to almost released
CriticalSection.RecursionCount := 1;
TInterlocked.Add(CriticalSection.LockCount, -(RecursionCount - 1));
// Release and get in line for someone to do a Wake or WakeAll
LeaveCriticalSection(CriticalSection);
// This is, admitedly, a potential race condition... what do you think?
Result := WaitForSingleObject(WaitingThread.WaitEvent, Timeout) = WAIT_OBJECT_0;
// Got to get the lock back and block waiting for it.
EnterCriticalSection(CriticalSection);
// Remove any dangling waiters from the list
RemoveWaiter(ConditionVariable, WaitingThread);
// Lets restore all the recursion and lock counts
TInterlocked.Add(CriticalSection.LockCount, RecursionCount - 1);
CriticalSection.RecursionCount := RecursionCount;
finally
CloseHandle(WaitingThread.WaitEvent);
end;
end else
Result := False; // should call SetLastError with the appropriate error code.
end;

Upon entry, we make sure that the calling thread owns the lock and if not return false (and as the comment suggests we should call SetLastError with the appropriate error code).  Once we determine that the calling thread owns the lock, we need to start the process of tearing down the critical section and blocking on the wait event.  The main difference between TMonitor.Wait and this function is that we still want to make sure that any threads waiting for the critical section are properly woken up, so we want to first tear down the critical section to within 1 count of being released, then call LeaveCriticalSection() which will do the final deed and unblock an waiters.  We then immediately block on the wait event.  Once we return from the wait either by being signaled or timing out, we have to re-acquire the critical section.  This will only be a recursion level of 1, so we now need to restore any recursions to it.  The QueueWaiter and RemoveWaiter functions use the LockQueue and UnlockQueue functions above to lock and manage the condition variable's waiting thread queue.  Like the TMonitor.Pulse and TMonitor.PulseAll functions, the Wake and WakeAll functions are very simple.


procedure WakeAllConditionVariable(var ConditionVariable: TRTLConditionVariable);
var
WaitingThread: PWaitingThread;
begin
WaitingThread := DequeueWaiter(ConditionVariable);
if WaitingThread <> nil then
SetEvent(WaitingThread.WaitEvent);
end;

procedure WakeAllConditionVariable(var ConditionVariable: TRTLConditionVariable);
var
WaitQueue, WaitingThread: PWaitingThread;
begin
WaitQueue := LockQueue(ConditionVariable);
try
WaitingThread := DequeueWaiterNoLock(WaitQueue);
while WaitingThread <> nil do
begin
SetEvent(WaitingThread.WaitEvent);
WaitingThread := DequeueWaiterNoLock(WaitQueue);
end;
finally
UnlockQueue(ConditionVariable, WaitQueue);
end;
end;

For the WakeAllConditionVariable function, we want to atomically wake up all current waiters and don't want any intervening threads to be injected into the queue.  So we have a special no lock version of DequeueWaiter which only takes the tail pointer passed by reference ("var" parameter).  It will update the tail pointer appropriately, so when UnlockQueue is called, it should be nil.  In fact DequeueWaiter simply locks the queue and then calls DequeueWaiterNoLock then unlocks the queue.


I'll totally understand if you have the sudden urge to take a shower, or to stick red-hot needles into your eyeballs... but sometimes it is just a little fun to break the rules (evil grin :-).  I will not be surprised if this shows up on Coding Horror or The Daily WTF for public ridicule :-)  As for an alternative implementation, all I want is an efficient user-level implementation of keyed events.  That would eliminate the queue and make this function work more like the Vista version.