Thursday, January 17, 2008

Simmering Unicode, bring DPL to a boil

Now that most of the ingredients are in the Unicode pot, I'm going to move it to the back burner and turn it down to a low simmer.  We'll come back to it later and give it a few stirs and make sure the seasoning is right.  Meanwhile I want to start work in the next course, DPL, or the Delphi Parallel Library.  Remember this is merely a working name for all the work and research I've been doing around adding more support for parallel/concurrent processing.  DPL, IS NOT and official feature of Tiburón.

As a part of the my research into parallel computing I've also been reading up whole bunch on the notion of lock-free data structures.  These little bits of computer science wonders really help take some of the most common and mundane data structures and algorithms and make them much more friendly to the world of multi-core systems.  However, they're also not a panacea and not all environments lend themselves to making the most out of them.  Some of the best write-ups on lock-free algorithms are the ones from Julian Bucknall of Developer Express.  He has some excellent write-ups on lock-free Queues and Stacks.  Although these examples are written in C# and do depend on the nicely garbage collected universe in which it runs (.NET), the concepts mostly transfer.  In one of the articles Julian even threatened that he'd translated some of these items into Delphi/Win32 code...  Hmm... I wonder if he solved the one somewhat intractable problem about node deletion in the underlying linked lists?  I'm still trying to work on that one myself, so Julian, if you've got something clever, let me know.

While trying to avoid locks is a laudable goal, at some point you're just going to have to use some form of locking.  So I've been working on an implementation of a TMonitor class.  .NET has another nice advantage here that the notion of a Monitor is intrinsic to any object instance in .NET.  Eventually this is how I want the TMonitor to work as well.  So what is a Monitor? It is basically the combination of a lightweight mutual exclusion lock and a condition variable.  In fact the wikipedia article about a condition variable mentions the monitor as being that combination.  Up until Windows Vista, Windows has not had any notion of a condition variable.   Even Microsoft has admitted that the event based technique of using an auto-reset event with the PulseEvent API was fraught with problems.  Even the PulseEvent documentation says you should no longer use it and instead use the new Condition variable APIs.  Cool!  Great!  We now have intrinsic OS support for creating a full TMonitor class, right?  Not so fast.  The problem is that these APIs only appeared in Windows Vista and Server 2008! Grrrrr.  I want my code to work at least with Windows 2000.

So, now I'm stuck with figuring out how to create a monitor without the benefit of the OS.  So, off I went to research the bits an pieces of what I need to do this.  I knew that POSIX threads  (pthreads) have the notion of a condition variable so I wondered how those POSIX Win32 layers implement them.  That was when I stumbled upon this cool article about implementing POSIX condition variables on Win32.  It is a few years old but was also very informative.  So off I went and was able to add a new TConditionVariable class to the SyncObjs.pas unit.  The problem is that the underlying implementation, how shall I say it?, less than modern and is very academic.  The other problem is that it depends on a kernel-level mutex handle as the surrounding lock.  Mutexes are great if you want cross-process mutual exclusions or you need a handle on which to pass to a WaitForMultipleObjectEx call.  Other than that, the real mutual exclusion work-horse is the critical section.  Critical sections are process local, lightweight and fast.  The nice thing about them is that if there is no contention (only one thread tries to acquire the crit-sec at a time), no locking happens.  This means the code executes purely in user-space and no expensive kernel-mode switch is needed.  Another problem is that critical sections are (supposed to be) opaque structures.  Even though the Windows API headers clearly show all the fields and give them nice names like LockCount, RecursionCount, etc... one should not muck with them directly.  If for whatever reason, Microsoft decided to change the structure or change the underlying implementation, any code that directly touched those fields is asking to to be spanked, and spanked hard.

What to do... what to do... While the specific critical section structure and related APIs are opaque, the underlying implementation is fairly straight forward.  The tricky part will be in adding the condition variable-esque aspects, notably the Wait/Notify(Pulse)/NotifyAll(PulseAll) functionality.  Lets start with the mutex (critical section) bits.  We need only three APIs for this, Enter, Leave(Exit), and TryEnter.  Just like the OS' critical section we need a few fields to track the state of the critical section.  Here's the declaration of what is needed so far:

TMonitor = class sealed
strict private
FLockCount: Integer; // Initialized to -1, no lock holders
FRecursionCount: Integer; // Keep track of how many times we've reentered
FOwningThread: Cardinal; // Who currently owns the lock
FSpinCount: Integer; // Special busy-wait spin-lock for multi-proc systems
FLockEvent: THandle; // The auto-reset event on which to wait when in contention
function GetEvent: Cardinal;
function CheckOwningThread: Cardinal;
class var FProcessorCount: Integer;
class function GetProcessorCount: Integer; static;
public
constructor Create; overload;
constructor Create(ASpinCount: Integer); overload;
destructor Destroy; override;

procedure Enter; overload; inline;
function Enter(Timeout: Cardinal): Boolean; overload;
procedure Exit;
function TryEnter: Boolean;
end;

Let's start with the TryEnter function since is the simplest and most straight forward since it never blocks the calling process.  Here it is including the comments! (wow, he commented the code :-).


function TMonitor.TryEnter: Boolean;
begin
// First things first - initially check to see if we can gain ownership
if TInterlocked.CompareExchange(FLockCount, 0, -1) = -1 then
begin
// Yep, got it. Now claim ownership
FOwningThread := GetCurrentThreadId;
FRecursionCount := 1;
Result := True;
end else if FOwningThread = GetCurrentThreadId then // check for recursion
begin
// We're recursing now, but FLockCount still needs to be guarded
TInterlocked.Increment(FLockCount);
// Only the owning thread can increment this value so no need to guard it
Inc(FRecursionCount);
Result := True;
end else
Result := False;
end;

The first thing we want to do is just try to get the lock.  That is what the CompareExchange does. (oh, BTW the TInterlocked class is just a collection of thin class static functions that wrap the OS interlockedXXXX functions).  CompareExchange is more commonly referred to as a Compare and Swap or CAS for short.  This atomic function is the basis by which nearly all other atomics can be implemented.  Learn it. Understand it. It is that important.  Since the FLockCount is initialized to -1, we try to swap in a 0 if and only if FLockCount is a -1.  This is done as an atomic operation which means while this operation proceeds, no thread switching, or other memory modifications can occur from anywhere (other CPUs, DMA operations, etc.).  The CPU Bus is locked down for the duration of that one operation which is typically a single CPU instruction, CMPXCH on the the x86 platform.  This is extremely important for this all to work.  If CompareExchange returns a -1, then we know that the original value of FLockCount was -1 and the 0 was successfully "swapped" in.  If it returns anything other than -1, someone else has already gotten here first or we already hold the lock.  We check the latter by looking at the FOwningThread field and comparing it to the current thread's Id.  If we already own the lock, then this is a recursion case and we just need to do some housekeeping and return.  Notice that we don't need to use the Interlocked inc on the FRecursionCount field since only the owner of the lock can touch that field.   If some other thread owns the lock, then we simply return indicating that we cannot get the lock.


If TryEnter was the vegatables for this course, then Enter  is the meat.  This is where things can get interesting.  Here's the Enter function in all it's glory.


function TMonitor.Enter(Timeout: Cardinal): Boolean;
var
SpinCount: Integer;
begin
// Get the spin count
if FSpinCount > 0 then
begin
Result := TryEnter;
if Result then
System.Exit;
SpinCount := FSpinCount;
while SpinCount > 0 do
begin
// if there are already waiters, don't bother spinning
if FLockCount > 0 then
Break;
// Try to get the lock
if FLockCount = -1 then
if TInterlocked.CompareExchange(FLockCount, 0, -1) = -1 then
begin
FOwningThread := GetCurrentThreadId;
FRecursionCount := 1;
Result := True;
System.Exit;
end;
asm
REP NOP // Just do nothing here...
end;
Dec(SpinCount);
// Keep trying until the spin count expires
end;
end;
Result := True;
// Add our count to the lock
if TInterlocked.Increment(FLockCount) > 0 then
begin
// Now check if we're the owner
if FOwningThread = GetCurrentThreadId then
// Yep, we're the owner so just recurse
Inc(FRecursionCount)
else
begin
// We're not the owner, so blocking is needed
// GetEvent does a "safe" allocation of the Event
Result := WaitForSingleObject(GetEvent, Timeout) = WAIT_OBJECT_0;
if Result then
begin
FOwningThread := GetCurrentThreadId;
FRecursionCount := 1;
end else
// We timed out, remove our presence from the lock count
TInterlocked.Decrement(FLockCount);
end;
end else
begin
FOwningThread := GetCurrentThreadId;
FRecursionCount := 1;
end;
end;

Let's ignore the spin-lock at the beginning of this function for now.  That section deserves to be covered separately.  Starting right after the spin-lock loop, we first try to add our count and/or acquire the lock with a simple interlocked inc of the FLockCount field.  If this lock was not already owned, the result will be 0 (remember it starts at -1) so the else section is taken and we claim ownership of the lock.  If the result is greater than 0, then that means someone else has to the lock or we're in a recursion.  We do the same FOwningThread check as was done in TryEnter and if we already own the lock simply add to the  recursion count.  If we don't own the lock, then we're in contention and have to block.  Unlike TryEnter, Enter either returns and the lock is now owned by the calling thread, or it timed out and never was able to get the lock.  This timeout is something that the OS' critical section does not support.  We'll cover it in a moment, but another neat concurrency trick is in the GetEvent method.   We call WaitForSingleObject on the event handle, which is an auto-reset event, with the timeout the caller specified.  The overloaded/inlined Enter method simply calls this version of Enter with INFINITE as the timeout.  If the wait returns that the object was signaled that means we now own the lock (FLockCount should be >= 0 at this point) and we can claim it by setting the FOwningThread and the initial recursion count.  However, if we timed out and could not get the lock, we simply need to decrement the FLockCount field to remove the fact that we were waiting in line for the lock.


The simplest and easiest to understand function is Exit.  Here's what it looks like:


procedure TMonitor.Exit;
begin
CheckOwningThread;
Dec(FRecursionCount);
if FRecursionCount > 0 then
TInterlocked.Decrement(FLockCount)
else
begin

FOwningThread := 0;
if TInterlocked.Decrement(FLockCount) >= 0 then
SetEvent(GetEvent);
end;
end;

The CheckOwningThread thread is simply a function that ensures that the calling thread does in fact own the lock.  It will raise an exception if that is not the case because Exit is only valid to be called if the lock is owned by the calling thread.  If we do own the lock (an exception wasn't raised), we just decrement the FRecursionCount field.  Remember, this is safe because only the lock owner can touch that field.  Now we check if we still have recursion counts because Enter was called more than once.  If so we just need to remove our presence from the FLockCount field.  If FRecursionCount drops to 0 (we still "own" the lock) we know we need to release the lock and make sure one of the waiters are woken up and given the lock.  We clear the FOwningThread field to relinquish the claim of ownership and then remove the final count on the FLockCount field.  If there are still counts on it the result of the Interlocked decrement will be greater than or equal to 0 which means there is at least one thread blocked waiting for the lock.  In that case we just set the event and then return.


What about the GetEvent method I mentioned above?  This is where our good friend the CAS function really helps us out.  Here's the GetEvent method:


function TMonitor.GetEvent: THandle;
var
Event: THandle;
begin
Result := FLockEvent;
if Result = 0 then
begin
Event := CreateEvent(nil, False, False, nil);
Result := TInterlocked.CompareExchange(Integer(FLockEvent), Event, 0);
if Result = 0 then
// We won! Nobody else was trying to allocate the Event.
Result := Event
else
// Oh Well. We tried. Close the handle if someone got to it first.
CloseHandle(Event);
end;
end;

The first thing we do is simply check if it has already been allocated.  If so, just return it.  If not, we now enter a race to see which thread gets to create the event.  This is an intentional race and if you look closely at the code the result of the CAS operation tells us whether or not we won the race.  So we create the new event object, then try and swap it into the FLockEvent field.  The CAS will make sure FLockEvent is 0 and of so assign the Event value to it.  It will then return the original value of FLockEvent. If by the time the CAS operation is performed, another thread was able to assign the FLockEvent field, then the CAS operation won't change the FLockEvent field.  It will still return the original value of FLockEvent which is assigned to the result.  This leave the dangling Event local variable, which we simply close it to release it back to the OS.  Theoretically, there could be as many events as there are threads running the GetEvent method created, but only one will ever be assigned to the FLockEvent.  Remember, the code within the if block will only execute if at the point of the test, the FLockEvent field isn't assigned.  Once some thread is able to allocate the event, the next time this function simply grabs the value of the field and returns it.


There you have it, we just recreated a lightweight critical section with the same advantages that the OS critical section possess.  It tries to remain in user-space for most things and only uses a kernel object when there is contention for the resource.  Next time I'll cover the spin-lock in the Enter method and explain how it can serve to boost performance on multiple processor (or multi-core) systems.  I'll also present the Wait/Pulse/PulseAll methods.


NOTE: Yes, I know that there is a chance that the CreateEvent call in GetEvent can fail if the system is low on resources. There is an excellent explanation on Joe Duffy's blog about this along with some interesting tidbits about the new condition variables in Vista.  If you're interested in threading, concurrency, and parallelism, Joe's blog is a must-read.  Just today he's posted some random thoughts on handling blocked threads and work schedulers.  Which is rather interesting because all this stuff about a TMonitor class will eventually come back around to my latest foray into creating a thread pool... which is actually along the lines of his #2 commonplace solution.  Until Delphi gets some kind of continuation passing style (CPS) that will probably continue to be the solution.

27 comments:

  1. Just in case it's of any help: SpinLock implementation from one of Slovenian Delphi developers and my test suite for it. At the moment the first link is defunct (the author is travelling around the world and it seems that his home server is down) but the second includes the freshest SpinLock.pas anyway. Also included is a special SpinLock version that supports multiprocess spinlocks.

    What I found out using this implementation:
    - it has a bad worst-case behaviour
    - it is very fast on average
    - it is hard to beat TCriticalSection as it is implemented as a spinlock on newer Windows; you can even set spin count during the initialization (InitializeCriticalSectionAndSpinCount, found on Windows 2000 and newer; there is also a TryEnterCriticalSection on Win2000+)
    - there are big problems with starvation if one thread is constantly locking and unlocking spinlock; TCricialSection is handled better as it increases the priority of the starved thread after some time (or at least that's how it looks from the outside)
    - your comment box is too small ;)

    So maybe - just maybe - using critical section object on Windows 2000 and better is preferrable.

    Feel free to use and abuse my test suite in any way you want.

    ReplyDelete
  2. It seems that somehow I mangled the link to my spinlock test suite in my previous comment. The correct URL is http://www.gabrijelcic.org/testSpinLock.zip.

    ReplyDelete
  3. You have got the Dev Express web site wrong :)

    Its www.devexpress.com and not www.developerexpress.com

    ReplyDelete
  4. I have a question. Is it safe to read an object in one thread while another thread is writing to that same object?

    In your TryEnter code, you do an "if FOwningThread = GetCurrentThreadId then". There is a chance that another thread could be writing to FOwningThread while this comparison is taking place. Would that not cause a problem? This occurs a couple of times with FOwningThread.

    ReplyDelete
  5. gabr,

    Lee Nover is in Brazil, enjoying the beachs here... until jan 28

    []s

    ReplyDelete
  6. @Cesar Romero: Yes, I know. Tell him his server is down if you see him :)

    ReplyDelete
  7. To comment on Dean : Yes, this _will_ be a problem. I've witnessed it loads of times in our timing-critical code.

    The only solution for this is, to atomically claim the lock in one go.
    Special care needs to be taken however, to differentiate a recursive lock and a 'new' lock :


    function TMonitor.TryEnter: Boolean;
    var
    StartingOwner, Me, NewOwner: TThreadID;
    begin
    // Remember current owner :
    StartingOwner := FOwningThread;
    // Only call GetCurrentThreadId once :
    Me := GetCurrentThreadId;
    // Try to take ownership :
    NewOwner := TInterlocked.CompareExchange(FOwningThread, StartingOwner, Me);

    // Now determine if i'm the owner :
    Result := (NewOwner = Me);
    if not Result then
    Exit;

    // I'm the owner, but was that already the case?
    if StartingOwner = Me then
    // recursive case :
    Inc(FRecursiveCount)
    else
    FRecursiveCount := 1;

    TInterlocked.Increment(FLockCount);
    end;

    Note that NewOwner is just here to make things more clear, this will equally work :


    // Try to take ownership :
    Result := (TInterlocked.CompareExchange(FOwningThread, StartingOwner, Me) = Me);
    if not Result then
    Exit;

    ReplyDelete
  8. Patrick,

    Until the lock is obtained, nobody else can write to the FLockOwner. There is no need to atomically try to read the FLockOwner field. Also, as long as the field is aligned (it will be), the read will be atomic anyway.

    Allen.

    ReplyDelete
  9. Gabr,

    I suppose you didn't read the post completely. It was not about simply recreating a critical section. If that is all you need, then you should use the OS' implementation. I'm trying to implement a full monitor class with condition variable semantics. That is not available on any Windows platform prior to Vista and Win2008. Unless the monitor is created with a spincount, it won't spin-lock. It also ensures that if the monitor is on a single CPU system it doesn't ever initialize the spincount.

    Allen.

    ReplyDelete
  10. No, no. Is the smae old thing.

    Please take a look at

    http://www.defmacro.org/ramblings/concurrency.html

    This is truly innovating (that its, if is implemented in a imperative language!).

    I try to advance it, but lack the magic skills for it:

    http://groups.google.com.co/group/borland.public.delphi.internet.winsock/browse_thread/thread/203d631334c7b03f/9e6d58de23daae87?lnk=st&q=delphi+erlang&rnum=8#9e6d58de23daae87

    I thin that the erlang style is:

    - Far more scalable. We are in the era of internet, not single workstations

    - Far more simple. Not need to mess with all the troubles of locks.

    - Conceptually simple

    I have more than 10+ years of experience. I can do pointers and a lot of tricks but not multi-threading. Is too hard!.

    So, not go the .net way. Is not intuitive and anyway is a wrapper around the same old history. In the multi-core era, is more clever trun thread in cheap process and do serialization.

    ReplyDelete
  11. Alan,

    Is there a chance you can post the code, like you did with your other multithreading posts?

    Thanks.

    ReplyDelete
  12. mamcx,

    I'm fully aware of the benefits and also the challenges of taking some of the ideas from languages such as Erlang. However, those are long-lead items and require significant internal reworking of both the libraries and the compiler.

    If you read Joe Duffy's latest blog post, he mentions some of the problems with trying to add true CPS to the .NET platform. The same holds true for native code with respect to how the stack is managed and the how whole state is saved off.

    That being said, there is also some ongoing internal research about doing something very close to what you're talking about. Most of this work has to be done in stages as we work to evolve toward some of these same goals.

    Allen.

    ReplyDelete
  13. Excellent. Hope this way get the attention that deserve it. Could be a boost similar in how the VCL provide a nicer view of the Win32 api.

    ReplyDelete
  14. Allen,

    I was referring to the FOwningThread while a lock was being established. If a second thread tried to obtain a lock immediately after the first, the first could be busy setting the FOwningThread while the second was comparing it to it's own thread ID.

    Is there something special you are doing about ensuring that the reads are atomic (field aligned) as I do a lot of multithreaded stuff and have ensured that writes and reads don't conflict. This sort of optimisation could make a big difference.

    BTW: This whole article has been very enlightning. So much so that I started searching the VCL for TInterlocked so I could start updating some of my code :)

    Thanks

    Dean

    ReplyDelete
  15. Dean,

    Yes, the compiler and the memory manager do try to make sure alignments are maintained. This is critical to ensuring that reads don't happen across natural alignment boundaries. If you can prove that there is an issue with reads being corrupted because of this, I'd be very interested. However I suspect that if that were the case, then Window's own critical section would be headed for failure as well :-).

    Allen.

    ReplyDelete
  16. I have never seen it happen, I just always protect against it. I remember reading an article when I first started doing multithreaded development that indicated that all accesses that potentially involve a write should be protected. Perhaps Thorsten (from Nexus) has seen something.

    ReplyDelete
  17. Dean,

    For the x86 family of CPUs, aligned reads should always be atomic (not counting some of the newer SSE instructions). This is not necessarily the case for all CPUs. Many RISC based systems have special "memory barrier" instructions to guard against these problems and other issues with instruction re-ordering. Itanium, ARM, and others do require extra instructions to ensure atomicity.

    Allen.

    ReplyDelete
  18. That makes sense. It often helps to understand the underlying OS and even hardware when developing software as these sorts of things can have a big impact on designing/developing a solution.

    Thanks for the info.

    Dean

    ReplyDelete
  19. @Allen re "Unless the monitor is created with a spincount, it won’t spin-lock. It also ensures that if the monitor is on a single CPU system it doesn’t ever initialize the spincount."

    You're right - I missed that. Thanks for pointing it out.

    ReplyDelete
  20. Allen

    Deletion is easy. The biggest issue is not the deletion, it's getting rid of the node afterwards. Now that's hard. Which of course is why I rely on the good old GC in my C# code.

    I wrote an article for The Delphi Magazine on hazard pointers, which is one technique for disposing of the nodes. You can still get the code there. Issue 137. Or send me an email for the article as well.

    Cheers, Julian

    ReplyDelete
  21. Allen

    Now I reread the whole thing (two times) - and I think that the problem was not in 'reading to the end' but 'missing the point of the exercise'. What I did not understand at the first time is that this is not the complete solution yet - you still have to plug Notify(etc) parts in. I still don't understand why you could not use ordinary critical section for protection access and why you have to recreate its behaviour, but that will supposedly become clearer when you write more code.

    I still want to point out something I mentioned before but what may get lost in the other noise I produced - when using such spinlocks in practical applications, I often found that there is bigger risk of thread starvation than when using critical sections when one thread is executing very small code fragment between locks (i.e. when spin lock was replaced with critical section, the problem went away). Yes, the locking in this case should be removed completely and it will be as soon as I find a single producer-single consumer lock-free queue that really works :(

    Anyway, it seems that Windows fight against starvation in such cases. I don't know how, really, but it probably happens inside the kernel in Wait code anyway.

    Still, when I look at your Enter, it seems that in such cases all waiting threads will eventually land inside the Wait anyway and Windows will sort them out so that starvation should not occur. Seemingly, Lee's spinlock implementation does this in a less optimal way which causes a problem in such scenario.

    ReplyDelete
  22. All this prove the point that the best way is simply get ride of the problem: No locks, not shared acces= No bugs.

    ReplyDelete
  23. @mamcx: Give me a lock-free framework and I'll be happy to use it ...

    ReplyDelete
  24. Julian,

    I was abiguous. Sorry about that. By "deletion" I meant deallocating the node memory, ie. "freeing." I'll look up that article. Thanks for the reference.

    Allen.

    ReplyDelete
  25. Allen,

    "For the x86 family of CPUs, aligned reads should always be atomic"...

    I checked with Intel's instruction reference and software development manuals. You're right, reading/writing *) a byte, *) a word aligned to a 16-bit boundary or *) a dword on a 32-bit boundary etc. are guaranteed to be atomic.

    Also the XCHG instruction is atomic, but CMPXCHG isn't -- unless you explicitly prefix it with LOCK. Same applies to bit-based operations like BTC.

    ReplyDelete
  26. Philipp,

    Yes, all of the InterlockedXXXX functions do prefix the instructions used to implement them with LOCK. The TInterlocked.Add() function would be something like LOCK ADD [Target],EAX.

    What LOCK means is to tell the CPU to assert a BusLock signal which tells any other CPU, DMA (Direct Memory Access) chip, Video, PCI, and any other subsystem that the memory bus is not available for any other operation. LOCK can be expensive if overused.

    Allen.

    ReplyDelete
  27. [...] reading the article of “The Oracle at Delphi” (Allen Bauer), Oracle is all I understand The article [...]

    ReplyDelete

Please keep your comments related to the post on which you are commenting. No spam, personal attacks, or general nastiness. I will be watching and will delete comments I find irrelevant, offensive and unnecessary.