Thursday, January 17, 2008

Simmering Unicode, bring DPL to a boil

Now that most of the ingredients are in the Unicode pot, I'm going to move it to the back burner and turn it down to a low simmer.  We'll come back to it later and give it a few stirs and make sure the seasoning is right.  Meanwhile I want to start work in the next course, DPL, or the Delphi Parallel Library.  Remember this is merely a working name for all the work and research I've been doing around adding more support for parallel/concurrent processing.  DPL, IS NOT and official feature of Tiburón.

As a part of the my research into parallel computing I've also been reading up whole bunch on the notion of lock-free data structures.  These little bits of computer science wonders really help take some of the most common and mundane data structures and algorithms and make them much more friendly to the world of multi-core systems.  However, they're also not a panacea and not all environments lend themselves to making the most out of them.  Some of the best write-ups on lock-free algorithms are the ones from Julian Bucknall of Developer Express.  He has some excellent write-ups on lock-free Queues and Stacks.  Although these examples are written in C# and do depend on the nicely garbage collected universe in which it runs (.NET), the concepts mostly transfer.  In one of the articles Julian even threatened that he'd translated some of these items into Delphi/Win32 code...  Hmm... I wonder if he solved the one somewhat intractable problem about node deletion in the underlying linked lists?  I'm still trying to work on that one myself, so Julian, if you've got something clever, let me know.

While trying to avoid locks is a laudable goal, at some point you're just going to have to use some form of locking.  So I've been working on an implementation of a TMonitor class.  .NET has another nice advantage here that the notion of a Monitor is intrinsic to any object instance in .NET.  Eventually this is how I want the TMonitor to work as well.  So what is a Monitor? It is basically the combination of a lightweight mutual exclusion lock and a condition variable.  In fact the wikipedia article about a condition variable mentions the monitor as being that combination.  Up until Windows Vista, Windows has not had any notion of a condition variable.   Even Microsoft has admitted that the event based technique of using an auto-reset event with the PulseEvent API was fraught with problems.  Even the PulseEvent documentation says you should no longer use it and instead use the new Condition variable APIs.  Cool!  Great!  We now have intrinsic OS support for creating a full TMonitor class, right?  Not so fast.  The problem is that these APIs only appeared in Windows Vista and Server 2008! Grrrrr.  I want my code to work at least with Windows 2000.

So, now I'm stuck with figuring out how to create a monitor without the benefit of the OS.  So, off I went to research the bits an pieces of what I need to do this.  I knew that POSIX threads  (pthreads) have the notion of a condition variable so I wondered how those POSIX Win32 layers implement them.  That was when I stumbled upon this cool article about implementing POSIX condition variables on Win32.  It is a few years old but was also very informative.  So off I went and was able to add a new TConditionVariable class to the SyncObjs.pas unit.  The problem is that the underlying implementation, how shall I say it?, less than modern and is very academic.  The other problem is that it depends on a kernel-level mutex handle as the surrounding lock.  Mutexes are great if you want cross-process mutual exclusions or you need a handle on which to pass to a WaitForMultipleObjectEx call.  Other than that, the real mutual exclusion work-horse is the critical section.  Critical sections are process local, lightweight and fast.  The nice thing about them is that if there is no contention (only one thread tries to acquire the crit-sec at a time), no locking happens.  This means the code executes purely in user-space and no expensive kernel-mode switch is needed.  Another problem is that critical sections are (supposed to be) opaque structures.  Even though the Windows API headers clearly show all the fields and give them nice names like LockCount, RecursionCount, etc... one should not muck with them directly.  If for whatever reason, Microsoft decided to change the structure or change the underlying implementation, any code that directly touched those fields is asking to to be spanked, and spanked hard.

What to do... what to do... While the specific critical section structure and related APIs are opaque, the underlying implementation is fairly straight forward.  The tricky part will be in adding the condition variable-esque aspects, notably the Wait/Notify(Pulse)/NotifyAll(PulseAll) functionality.  Lets start with the mutex (critical section) bits.  We need only three APIs for this, Enter, Leave(Exit), and TryEnter.  Just like the OS' critical section we need a few fields to track the state of the critical section.  Here's the declaration of what is needed so far:

TMonitor = class sealed
strict private
FLockCount: Integer; // Initialized to -1, no lock holders
FRecursionCount: Integer; // Keep track of how many times we've reentered
FOwningThread: Cardinal; // Who currently owns the lock
FSpinCount: Integer; // Special busy-wait spin-lock for multi-proc systems
FLockEvent: THandle; // The auto-reset event on which to wait when in contention
function GetEvent: Cardinal;
function CheckOwningThread: Cardinal;
class var FProcessorCount: Integer;
class function GetProcessorCount: Integer; static;
public
constructor Create; overload;
constructor Create(ASpinCount: Integer); overload;
destructor Destroy; override;

procedure Enter; overload; inline;
function Enter(Timeout: Cardinal): Boolean; overload;
procedure Exit;
function TryEnter: Boolean;
end;

Let's start with the TryEnter function since is the simplest and most straight forward since it never blocks the calling process.  Here it is including the comments! (wow, he commented the code :-).


function TMonitor.TryEnter: Boolean;
begin
// First things first - initially check to see if we can gain ownership
if TInterlocked.CompareExchange(FLockCount, 0, -1) = -1 then
begin
// Yep, got it. Now claim ownership
FOwningThread := GetCurrentThreadId;
FRecursionCount := 1;
Result := True;
end else if FOwningThread = GetCurrentThreadId then // check for recursion
begin
// We're recursing now, but FLockCount still needs to be guarded
TInterlocked.Increment(FLockCount);
// Only the owning thread can increment this value so no need to guard it
Inc(FRecursionCount);
Result := True;
end else
Result := False;
end;

The first thing we want to do is just try to get the lock.  That is what the CompareExchange does. (oh, BTW the TInterlocked class is just a collection of thin class static functions that wrap the OS interlockedXXXX functions).  CompareExchange is more commonly referred to as a Compare and Swap or CAS for short.  This atomic function is the basis by which nearly all other atomics can be implemented.  Learn it. Understand it. It is that important.  Since the FLockCount is initialized to -1, we try to swap in a 0 if and only if FLockCount is a -1.  This is done as an atomic operation which means while this operation proceeds, no thread switching, or other memory modifications can occur from anywhere (other CPUs, DMA operations, etc.).  The CPU Bus is locked down for the duration of that one operation which is typically a single CPU instruction, CMPXCH on the the x86 platform.  This is extremely important for this all to work.  If CompareExchange returns a -1, then we know that the original value of FLockCount was -1 and the 0 was successfully "swapped" in.  If it returns anything other than -1, someone else has already gotten here first or we already hold the lock.  We check the latter by looking at the FOwningThread field and comparing it to the current thread's Id.  If we already own the lock, then this is a recursion case and we just need to do some housekeeping and return.  Notice that we don't need to use the Interlocked inc on the FRecursionCount field since only the owner of the lock can touch that field.   If some other thread owns the lock, then we simply return indicating that we cannot get the lock.


If TryEnter was the vegatables for this course, then Enter  is the meat.  This is where things can get interesting.  Here's the Enter function in all it's glory.


function TMonitor.Enter(Timeout: Cardinal): Boolean;
var
SpinCount: Integer;
begin
// Get the spin count
if FSpinCount > 0 then
begin
Result := TryEnter;
if Result then
System.Exit;
SpinCount := FSpinCount;
while SpinCount > 0 do
begin
// if there are already waiters, don't bother spinning
if FLockCount > 0 then
Break;
// Try to get the lock
if FLockCount = -1 then
if TInterlocked.CompareExchange(FLockCount, 0, -1) = -1 then
begin
FOwningThread := GetCurrentThreadId;
FRecursionCount := 1;
Result := True;
System.Exit;
end;
asm
REP NOP // Just do nothing here...
end;
Dec(SpinCount);
// Keep trying until the spin count expires
end;
end;
Result := True;
// Add our count to the lock
if TInterlocked.Increment(FLockCount) > 0 then
begin
// Now check if we're the owner
if FOwningThread = GetCurrentThreadId then
// Yep, we're the owner so just recurse
Inc(FRecursionCount)
else
begin
// We're not the owner, so blocking is needed
// GetEvent does a "safe" allocation of the Event
Result := WaitForSingleObject(GetEvent, Timeout) = WAIT_OBJECT_0;
if Result then
begin
FOwningThread := GetCurrentThreadId;
FRecursionCount := 1;
end else
// We timed out, remove our presence from the lock count
TInterlocked.Decrement(FLockCount);
end;
end else
begin
FOwningThread := GetCurrentThreadId;
FRecursionCount := 1;
end;
end;

Let's ignore the spin-lock at the beginning of this function for now.  That section deserves to be covered separately.  Starting right after the spin-lock loop, we first try to add our count and/or acquire the lock with a simple interlocked inc of the FLockCount field.  If this lock was not already owned, the result will be 0 (remember it starts at -1) so the else section is taken and we claim ownership of the lock.  If the result is greater than 0, then that means someone else has to the lock or we're in a recursion.  We do the same FOwningThread check as was done in TryEnter and if we already own the lock simply add to the  recursion count.  If we don't own the lock, then we're in contention and have to block.  Unlike TryEnter, Enter either returns and the lock is now owned by the calling thread, or it timed out and never was able to get the lock.  This timeout is something that the OS' critical section does not support.  We'll cover it in a moment, but another neat concurrency trick is in the GetEvent method.   We call WaitForSingleObject on the event handle, which is an auto-reset event, with the timeout the caller specified.  The overloaded/inlined Enter method simply calls this version of Enter with INFINITE as the timeout.  If the wait returns that the object was signaled that means we now own the lock (FLockCount should be >= 0 at this point) and we can claim it by setting the FOwningThread and the initial recursion count.  However, if we timed out and could not get the lock, we simply need to decrement the FLockCount field to remove the fact that we were waiting in line for the lock.


The simplest and easiest to understand function is Exit.  Here's what it looks like:


procedure TMonitor.Exit;
begin
CheckOwningThread;
Dec(FRecursionCount);
if FRecursionCount > 0 then
TInterlocked.Decrement(FLockCount)
else
begin

FOwningThread := 0;
if TInterlocked.Decrement(FLockCount) >= 0 then
SetEvent(GetEvent);
end;
end;

The CheckOwningThread thread is simply a function that ensures that the calling thread does in fact own the lock.  It will raise an exception if that is not the case because Exit is only valid to be called if the lock is owned by the calling thread.  If we do own the lock (an exception wasn't raised), we just decrement the FRecursionCount field.  Remember, this is safe because only the lock owner can touch that field.  Now we check if we still have recursion counts because Enter was called more than once.  If so we just need to remove our presence from the FLockCount field.  If FRecursionCount drops to 0 (we still "own" the lock) we know we need to release the lock and make sure one of the waiters are woken up and given the lock.  We clear the FOwningThread field to relinquish the claim of ownership and then remove the final count on the FLockCount field.  If there are still counts on it the result of the Interlocked decrement will be greater than or equal to 0 which means there is at least one thread blocked waiting for the lock.  In that case we just set the event and then return.


What about the GetEvent method I mentioned above?  This is where our good friend the CAS function really helps us out.  Here's the GetEvent method:


function TMonitor.GetEvent: THandle;
var
Event: THandle;
begin
Result := FLockEvent;
if Result = 0 then
begin
Event := CreateEvent(nil, False, False, nil);
Result := TInterlocked.CompareExchange(Integer(FLockEvent), Event, 0);
if Result = 0 then
// We won! Nobody else was trying to allocate the Event.
Result := Event
else
// Oh Well. We tried. Close the handle if someone got to it first.
CloseHandle(Event);
end;
end;

The first thing we do is simply check if it has already been allocated.  If so, just return it.  If not, we now enter a race to see which thread gets to create the event.  This is an intentional race and if you look closely at the code the result of the CAS operation tells us whether or not we won the race.  So we create the new event object, then try and swap it into the FLockEvent field.  The CAS will make sure FLockEvent is 0 and of so assign the Event value to it.  It will then return the original value of FLockEvent. If by the time the CAS operation is performed, another thread was able to assign the FLockEvent field, then the CAS operation won't change the FLockEvent field.  It will still return the original value of FLockEvent which is assigned to the result.  This leave the dangling Event local variable, which we simply close it to release it back to the OS.  Theoretically, there could be as many events as there are threads running the GetEvent method created, but only one will ever be assigned to the FLockEvent.  Remember, the code within the if block will only execute if at the point of the test, the FLockEvent field isn't assigned.  Once some thread is able to allocate the event, the next time this function simply grabs the value of the field and returns it.


There you have it, we just recreated a lightweight critical section with the same advantages that the OS critical section possess.  It tries to remain in user-space for most things and only uses a kernel object when there is contention for the resource.  Next time I'll cover the spin-lock in the Enter method and explain how it can serve to boost performance on multiple processor (or multi-core) systems.  I'll also present the Wait/Pulse/PulseAll methods.


NOTE: Yes, I know that there is a chance that the CreateEvent call in GetEvent can fail if the system is low on resources. There is an excellent explanation on Joe Duffy's blog about this along with some interesting tidbits about the new condition variables in Vista.  If you're interested in threading, concurrency, and parallelism, Joe's blog is a must-read.  Just today he's posted some random thoughts on handling blocked threads and work schedulers.  Which is rather interesting because all this stuff about a TMonitor class will eventually come back around to my latest foray into creating a thread pool... which is actually along the lines of his #2 commonplace solution.  Until Delphi gets some kind of continuation passing style (CPS) that will probably continue to be the solution.