The Oracle at Delphi

Tuesday, March 18, 2008

When being "objective" is being "biased"

I did the most unconscionable thing a few weeks ago. Something I've resisted for a very long time. I bought a Mac. For the first time. Ever. I'm not against Macs or Apple in general (I own an iPod and a Zune), in fact I find them very quaint and highly capable. Ever since they succumbed to the Intel juggernaut, my interest was piqued. They are no longer tied to what had become a "boutique" CPU, the PowerPC, but rather to a tried-and-true, mainstream, continuously developed and advancing architecture. Say what you will about all the flaws (and there are many) with whole x86 architecture, it is here to stay, is getting some serious industry investment from both Intel, AMD, and even VIA. Remember the whole IA64 (Itanium) phase? This just proves the point that even Intel cannot stop the force they created. AMD, recognized this early on and created the AMD64 extensions to the existing x86 architecture. Eventually, Intel realized that they're only reasonable course of action was to jump onto that bandwagon. It is the biggest example of "if you can't beat 'em, join 'em." Apple finally realized this when the powerPC just wasn't able to keep up in terms of raw CPU power and the onslaught of the multi-core CPU. Apple had to cut costs, and the easiest way to do it was to join the commoditized-to-the-hilt x86 industry. Yes, this is old news.

I'll get to the real motivation for swallowing my pride and buying the Mac in a moment. This article popped up on my RSS feed and it really resonated with me. Regardless of the comments, which all too painfully only served to prove the author's point (and which I'm sure there will be comments to this post which will do the same thing), it was interesting to see something I've long suspected, actually put into words. It was also so apropos since I've, sort of, joined the ranks of Mac owners. The gist of this article and the book from which it is excerpted (from what I can gather), is that we're all very passionate, like to "belong," and want to feel validated and sometimes even lauded for our choice in car, home, clothes, spouse, occupation, religion, computer, etc... Yes, I admit that I sometimes feel much the same way as the stereotypical Apple-fanboy when confronted with valid, well-reasoned, unbiased analysis.

This hit home all too well lately because, well, this new Mac is not really mine. Yes, I was involved with its purchase, but I don't actually use it. See, my wife has decided to return to school and formally pursue something she's come to really enjoy over the past few years, graphic design and layout. It started with doing monthly newsletters for the local Mothers of Twins club (my daughters are twins), then on to doing other various projects, and finally culminated in doing the entire Scotts Valley High School annual Football program. She's already committed to do it again for the Fall-2008 season. She has used PCs (natch) for many years and has a very nice 17" notebook. Now that she's back in school, which is where she's at while I write this post, it became painfully obvious that there is a very clear bias in this particular field toward the mac. Sure, the same applications are available for the PC, but to be truly "accepted" shouldering a mac really helps. Most printers happily work with raw files from these specific applications and will actually charge hefty fee if you submit work using certain PC-only applications. It also makes the professor actually want to talk with you.

Here is where this whole Mac-fanboy, "Apple can do no harm," attitude really hit me. As relayed by my wife, this professor unequivocally stated on the first day of class that you must use the Mac version of the applications. The pure shock and horror on the professor's face when asked if one could use the PC versions had to be priceless. A similar attitude was expressed when my wife submitted some homework which had a few things marked off. The professor actually told her that she must not have used a laser printer because a laser printer would have printed correctly. Well, that chapped my hide to no end... I wanted to load up my nice COLOR laser printer into the truck, drag it up to her class and plop it into the middle of the professor's classroom and insist that they prove to me that this is not a stinkin' laser printer! (no, it is not one of those dye-sublimation or "melted crayon" printers) That lasted for only a few moments as I stood there speechless and dumbfounded. Upon further examination and obtaining more information, it turns out that it wasn't printed on the same brand and model laser printer that the professor used. See this is a design and layout class for printed medium, and to grade the work, it is not about the content (which is usually that funky fake Latin text), but about its layout and presentation on the page. So where things are is more important that whether or not you spelled some word correctly. Ok, that makes sense, she'll have to use the school's laser printers instead of ours.

The whole purchasing experience was just too surreal. You walk into an Apple store, and it is literally like being a voyeur to this whole cult-like Apple... thing... I just cannot explain it. We'd already researched exactly which Mac we're going to get and we'll use my wife's student ID for a discount. It was simply a matter of walking up to the machine we wanted, point to it and grunt a few syllables about how we're going to pay for it and instruct the resident drone to walk back through the stainless steel door at the back of the store to obtain said box of Jobs' magic. The top-of-the-line 17" MacBook Pro is certainly a sexy bit of hardware. You just cannot deny that. That was nearly four weeks ago. At about two weeks into this, things started going a little south on us.

Once some of the applications needed for the class arrived (after getting some nice deep student discounts), it was time to install them. I get this frantic message from my wife saying that the machine won't recognize the disk, and now it won't even boot! Hmmm, ok... not. good. Oh, yeah, the whole "see I knew it!" flags started popping up in my head all over the place. They do have flaws! Eventually, she was able to eject the disk, after making some really horrendous noises and still never recognizing the disk. Less than two weeks old, and the Super-drive is dead.

You gotta love this, you can make an appointment at the "Genius bar" online. So we did. Drove back to the store and met with the "Genius" (yes, I'm snickering here...). We explained that we've had the machine for barely two weeks and only now had an occasion to use the drive and this is what happened. They tested it and, sure enough, no workee. Just nasty noises. I was ready for what happened next. I asked if they're going to "send it in" for repair. I was all ready to explain that if they went down that road, I'd be a little irritated because, to me this was a failure that left the factory in that condition and why should we be without the machine for however long it takes to fix it. Also, I'd just be inclined to return it for a refund. Luckily, they saw the logic agreed to simply replace the machine.

Yeah! They're going to replace it... not so fast, there buckaroo. It turns out that we'd bought the machine only a few days before Apple decided to announce some minor changes to the whole MacBook line. New things like LED backlit LCD, multi-touch track pad, newer Penryn-based CPU. That all meant they were out of stock. Rats. They said they'd call once the new machines were in. (I've heard that song and dance all too often) For the next week, we called the store several times and still no shipments. We contemplated simply getting a refund and ordering online, but that would mean not having the machine for school. Finally, yesterday, they actually called. A new machine had been set aside and to come in anytime. After arriving, we explained why we're there and they walked back through those sterile stainless steel doors, and returned with a box to which was affixed a post-it note with my wife's name on it. Nice.

After about another hour as they transferred the documents, settings and applications, we were on our way. And, yes, I did have them test the drive to make sure it worked :-). All-in-all, Apple's service was pleasant and understanding. I'd have to say that the hardware is no higher or lower quality than a solid ThinkPad or some of the newer Dells (older Dell notebooks had, ahem, some real issues). They're certainly more "artsy" looking than a pure utilitarian PC notebook. And OS X? It's just different. Some things I like about it, and others just make me want to scream. I can say the same things about XP or Vista.

I'm sure many will see through my "unbiased" view straight to the obvious bias in this post. And for those Apple-haters, I hope this redeems me; sitting next to me here in my home office, is a new, home-built, overclocked quad-core, 4GB, Vista x64, 512MB nVidia 8800, 1TB raid, PC rig that I had to build to seek forgiveness and cleansing from committing Apple-adultery ;-).

Friday, February 22, 2008

Thread pools Task handling.

The Delphi Parallel Library (DPL) tends to occupy a lot of my otherwise idle "thought" time. DPL has been redesigned in my head more times than I can count. The good news is that it is conceptually beginning to actually take some shape and is finding a direction. As I've stated before, a lot of the concepts behind the DPL are taken from Intel's TBB and Microsoft's TPL.

When I started down this road, it all began as an investigation into various ways of implementing a thread pool. It turns out that this simplistic mechanism really doesn't truly address parallelism and concurrent processing. It is merely a way of shoveling off a task into another thread, which may or may not execute concurrently. A thread pool doesn't do anything to "load-balance" the work in a constructive manner, nor does it purport to. That is left as an exercise for the user.

While reading and researching all the various approaches to handling parallelism, a few light bulbs started coming on. I've been approaching this whole thing from the wrong angle. I'd been trying to shoehorn task handling (a task being the smallest unit of work that can be made parallel) concepts into the more general notion of a thread pool. I admit that sometimes I'm not the sharpest knife in the drawer, but I can recognize when one approach is superior to another.

Rather than starting with the concept of thread pools, a parallel library should really begin at the most basic level, the task or just a unit of work. A task is merely some chunk of code that may or may not execute concurrently. Another thing to remember is that a parallel library is about getting these tasks done as fast as absolutely possible. It is also about keeping as many of the CPU cores as busy as possible with as little idle time and overhead. It is not about fairness. It is about speed.

To best do this, the internal scheduling and management of tasks should control how they are scheduled and processed on the CPU cores. Remember, we're talking about the actual real, hardware CPU core (or virtual core, like Hyperthreading). Not a thread. A thread is merely a way to easily segregate tasks among the cores. We'll rely on the operating system to make sure that of the available cores, then there are equally as many "threads" actually burning CPU cycles for real work. Any other threads are stuck waiting for their chance at a core. Because of this, a task manager should endeavor to keep the number of running threads as close to equal to the number of available cores. It should also equally handle 1 or 100 cores by making sure as many cores are a busy as possible.

Using a task based approach also requires that the threading and scheduling is completely (or nearly completely) hidden. The user should never think of threads. The user only deals with Tasks and the extra abstractions built on top of them. So it looks like DPL is headed for another rewrite, which is good because each time I do this it is only getting better. I've not spent a whole lot of time on the current iteration as it stands, so this isn't much of a setback.

Microsoft's TPL is one of the few things to come out in recent years that looks great and performs well. The concepts behind it are very well suited for Delphi. To get an idea of some of the things you can do with it (and eventually the DPL ;-) you really should watch this video. The really interesting bits come on toward the last 1/2 to 1/3 of the video where they go into the idea behind work stealing and work load balancing. Really, really interesting stuff. Joe Duffy also addresses how you need to think about parallelism and how it applies to your situation.

As part of this, I just ordered Joe's soon to be released book on Concurrent Programming on Windows Vista. You can actually buy the book online ahead of the actual print release and download a PDF of the work in progress until then. From reading only a few of the online excerpts, it looks to be one of those "must have" tomes for anyone doing any kind of parallel programming in native Windows and in .NET. Even though a lot of the information I know, it is good to have it codified in a consistent organized manner. As I stated in my first post about thread pools, as I dig deeper into this whole subject, I want to make sure that Delphi developers at all levels will find some value in such a library. The best thing you can do at this point is to learn as much about these concepts as you can. Hopefully this blog and others I've been referencing and will reference in the future will be an excellent source of information. I'm finding that as I write about this, it give me a chance to review my assumptions and design ideas with a critical eye. Hopefully, we can all learn something along the way as well. I know I have so far.

Tuesday, February 19, 2008

Lock my Object... Please!

Here's a quick recap of all the DPL (Delphi Parallel Library) related posts over the last few months:

A Critical[Section] Difference: Windows XP vs. Windows Vista
Breaking the rules
Simmering Unicode, bring DPL to a boil (Part 2)
Simmering Unicode, bring DPL to a boil
Placing your code in the forge - Refining a technique
When code lies - A better solution
Stupid Enumerator Tricks - And now for something completely different
Magical Assembler Incantations - Nested functions and anonymous methods
The Life and Times of a Thread Pool
I cut and I cut and it was still too short!
Spot the deadlock
Wading in the shallow end of the pool - Thread Pools

As another follow on to my discussion of the TMonitor (see the above "Bring DPL to a boil" articles) things have changed a little. The TMonitor class is no longer a separate class that you can create and use. It is now tied directly to any TObject instance or derivative (which means any instance of a Delphi class). There is still the notion of a TMonitor but only to serve as a way to tie together the "monitor" functionality. Rather than creating one and setting about to locking/unlocking it, you now only need an object instance. For Tiburón, you merely need to call System.TMonitor.Enter(<obj>); or System.TMonitor.Exit(<obj>); among the other related methods.

I've contemplated putting these methods directly on TObject itself, but there is the issue with calling these methods when a descendant class has a method of the same name. The code will compile and function normally due to Delphi's scoping rules, but it could make it harder for users to call those methods in these cases. Thread synchronization should never be done lightly and should require some forethought and planning. A simple rule of thumb is that you should never call unknown code (or code that you're not entirely sure where it goes and what it does) while holding a lock.

This now paves the way to add some interesting intrinsic functionality such as using this as the basis for "synchronized" methods or even synchronized code blocks similar to the C# lock keyword. While you've always been able to create an OS critical section, or use some of the helper classes in SyncObjs.pas, this new mechanism will be available on any object.

I merely design it then write about it... you get to kvetch :-)

Wednesday, February 6, 2008

A Critical[Section] Difference: Windows XP vs. Windows Vista

No, this isn't one of those comparisons! This is just something somewhat interesting about the difference in the implementation of a critical section in Windows XP vs. Windows Vista. It seems that Windows Vista is much more resilient in how it handle's the misuse of a critical section. One such degenerate, blatantly obvious case of misuse is doing the following:

LeaveCriticalSection(CSec);
EnterCriticalSection(CSec);

Yes this is wrong on so many levels, but the interesting thing to note is that under Windows XP, the above code hangs at the EnterCriticalSection call because the LeaveCriticalSection call did some very bad damage to the critical section structure. The problem is that the RecursionCount field of the critical section is decremented and left as non-zero (-1) which causes the LockCount field to also be decremented in order to keep the two field counts in sync. When the EnterCriticalSection call is made, the LockCount field is incremented, but it is still non-zero which makes it think there is another owner. The OwningThread Field is compared with the calling thread's ID which being 0, is not equal to the calling thread. But no other thread owns the lock! It blithely forges ahead and allocates the wait event and proceeds to block. Oops!

Windows Vista, in contrast, has changed how it manages the LockCount field. Rather than only accumulating the waiters (and recursions) in terms of simply incrementing (or decremented) the LockCount field, it uses the low-bit as the actual indication of holding the lock and accumulates the waiter and recursion counts in the remaining bits. This means that the critical section can now only have 2^31 ~~recursions and~~ waiters. (UPDATE: Upon further examination, only the waiters are indicated in the LockCount field) Still plenty of space. The upshot of this is that the above degenerate case no longer will cause a complete hang in many cases where it would have. Why they made the change is anybody's guess; Maybe Raymond Chen can pipe in and explain the reasons... Seems to me that anyone with doing the above deserves to be put into the penalty box. Now all that is going to happen is that when a program that once used to hang, will now appear to work only to probably show up some other flaw downstream. Seems like a hang/crash now or hang/crash later kind of thing. The end result is still the same.

Friday, February 1, 2008

ODPC - One Delphi Per Child

Ok, I just made that up. But that is the crux of the 1 million seat licensing deal just closed with the Russian Federal Agency of Education. Infoworld also has an article about the deal right here. A whole new generation of Delphi developers are now in the making!

Monday, January 28, 2008

Meanwhile, back at the (Unicode) ranch

I use the TStrings.ReadFromFile/WriteToFile methods all the time to read/write text files and manipulate them. Does this mean all these files are now Unicode and I can't read/write the files in the old format?

I've gotten this specific question a few times recently, so I figured I'd address it directly. The bottom line here is that we've got you covered. The defaults for those TStrings methods will now write the files ANSI encoded based on the active code page and will read the files based on whether or not the file contains a Byte-Order-Mark (BOM). If a BOM is found it will read the data encoded as the BOM indicates. If no BOM is found it will read it as ANSI and up-convert based on the current active codepage. All your files written with pre-Tiburón versions of Delphi will still be read in just fine, with the caveat that as long as you read with the active codepage the same as with what was written. Likewise, any file written with Tiburón (unless you override things which I'll describe in a moment) should be readable with a the pre-Tiburón version. At this point, only the most common BOM formats are detected (UTF16 Little-Endian, UTF16 Big-Endian and UTF8).

That's great news! But what if I want these files to be written as Unicode?

We realize that many Delphi applications will need to continue to interact with other applications or datasources, many of which can only handle data in ANSI or even ASCII. This is why TStrings exhibit the default behavior I described above. However, many of you will want to read/write text data using the TStrings class a loss-less Unicode format, be that Little-Endian UTF16, Big-Endian UTF16, UTF8, UTF7, etc... This led us to introduce a new TEncoding class to SysUtils.pas. This class is very similar in methods and functionality that you can find in the System.Text.Encoding class in the .NET Framework. That class contains several static functions that return standard instances of TEncoding class descendants that handle many of the common encoding formats. You can also create your own TEncoding class descendant should you feel the need to handle encodings that may be unique to your situation (say, for Base64 or Quoted-Printable).

The TEncoding class is key to how the TStrings class now handles being able to read/write its contents in the various formats. There are now overloaded versions of the TStrings.ReadFrom/WriteToXXXX functions that now take an instance of a TEncoding class. In most cases you'll just want to pass in one of the singleton versions that you can obtain by calling one of the class static methods on the TEncoding class. For instance, suppose you want to write the data out as UTF8, you'd call it like this:

var
  S: TStringList;
begin
  S := TStringList.Create;
  ...
  S.WriteToFile('config.txt', TEncoding.GetUTF8);
  ...
end;

Without the extra parameter, 'config.txt' would simply be converted and written out as ANSI encoded based on the current active codepage. The interesting thing to note here is that you don't need to change the read code since TStrings will automatically detect the encoding based on the BOM and do the right thing. If you wanted to force the file to read and written using a specific codepage, you can create an instance of TMBCSEncoding and pass in the code page you want to use into the constructor. Then you use that instance to read and write the file since the specific codepage may not match the user's active codepage.

I use INI files for configuration and extensively use TIniFile and TMemIniFile. What about those?

The same thing holds for these classes in that the data will be read and written as ANSI data. Since INI files have always been traditionally ANSI (ASCII) encoded, it may not make sense to convert these. It will depend on the needs of your application. If, you do wish to change to use a Unicode format, we'll offer ways to use the TEncoding classes to accomplish that as well.

In all the above cases, the internal storage will be Unicode and any data manipulation you do with string will continue to function as expected. (Unless you're doing some of those odd data-payload tricks I've mentioned previously). Conversions will automatically happen when reading and writing the data. Those of you that are familiar with VCL for .NET, you already know how all of the above works since the new overloads added to TStrings were introduced with VCL for .NET and use the System.Text.Encoding class for all these operations.

Thursday, January 24, 2008

Taking a big PByte of pointer math

Many of you have done it. We here have done it. It is just so danged convenient and satisfying. Sure, it is not regarded as a "safe" thing to do. Without the proper precautions, you can easily get into some serious trouble. But, hey, the C/C++ programmers do it all the time! What is "it"? Pointer math. Pointer math is simply treating any given typed pointer in some narrow instances as a scaled ordinal where you can perform simple arithmetic operations directly on the pointer variable. It also allows you to treat such a pointer variable as an unbounded array using the array [] operator. In Delphi there are only a few cases where this is possible. Let's set the WABAC machine for a little history lesson.

Around 1990 when Borland was moving Turbo Pascal to the new up-and-coming GUI operating system, Windows 3.0/3.1, the programming interface was a pure C-based API (with a "Pascal" calling convention, ironically). That meant that character strings were handled as null-terminated arrays of chars. These were mostly passed around by "reference" as a char* type. Given this, it was very common to scan through a string by treating the typed pointer as an array by applying the "array operator []." Likewise, simple arithmetic operations such as addition or subtraction are also common and convenient. In order to better match the Windows programming model, Turbo Pascal for Windows introduced a special PChar type. This type was unique in that the compiler had intrinsic knowledge of this type which allowed those same behaviors described previously.

Return to today. Because of this special "PChar" type behavior, many developers (ourselves included) would do crazy things such as casting a pointer of one type to a PChar and then do some pointer arithmetic. This was done because all other user-defined typed pointers would not allow these types of operations. What this has done is created cases where some code is littered either with a lot of pointers cast to PChars or the direct use of PChar pointers even when the data being manipulated isn't specifically byte-sized characters. In other words, PChars were used to manipulate byte-buffers of arbitrary data.

During the development of Tiburón, we discovered some of our own code was doing a lot of the above things. (I told you we've all done it!) Using the PChar trick was simply the only thing available and made a lot of the code simpler and a little easier to read. Now that PChar is an alias to PWideChar, things started falling over because the element type is now 2 bytes. As I mentioned above, a typed pointers are scaled based on the size of the element. Anyplace there is a PChar(Ptr) + 1, Inc(PChar(Ptr)), or similar expression, the result is an increment by 2 instead of 1. As you can imagine, this has caused a few headaches. In looking at the code, it was clear that the intent was to access this data buffer as an array of bytes, and was merely using a PChar as a convenience for accessing as an array or performing simple arithmetic.

What about using a PByte? PByte has been a type declared in the Delphi System unit for many releases. It is declared as simply a pointer to a byte or PByte = ^Byte. One solution would be to teach the compiler special intrinsic knowledge of the PByte type just like the current PAnsiChar and PWideChar types, but I didn't like where that was heading (what about the next type we'd want to do this to, and the next?). What if you wanted to have a typed pointer to some other type such as an Integer or a record? What if you wanted to do the same kinds of scaled pointer arithmetic and array indexing? So for Tiburón, we're introducing a new compiler directive {$POINTERMATH <ON|OFF>}. If you declare a typed pointer while this directive is on, any variable of that type will allow all the scaled pointer arithmetic and array indexing you like. Likewise, any block of code that is surrounded by that directive, will allow the same syntax for any typed pointers within that block regardless of whether or not it was originally declared that way. PByte is declared with that directive ON. This means that all the places that are using a the PChar type simply for the byte-pointer and byte-array access, can now use the PByte type instead and none of the existing code statements and logic needs to change. A simple search and replace over the source is what is needed. Sure, we could have changed the PChar references to PAnsiChar, but that simply serves to perpetuate the lack of clarity over what the intent of the code actually is. This directive is not on by default, and the only existing type that is declared with it on is PByte. It also only affects typed pointers. Variables of type "Pointer" do not allow the pointer math features since it is effectively pointing to a "void" element which is 0 size. Untyped "var" or "const" parameters are not affected because they're not really pointers (even though internally a "reference" in the form of an address is passed in).

This code will now compile with the new Tiburón compiler:

{$POINTERMATH ON}
procedure MoveRects(Rects: PRect; Count: Integer);
begin
  while Count > 0 do
  begin
    MoveRect(Rects[Count - 1]); // This line will not compile today.
    MoveRect((Rects + Count - 1)^); // Neither will this line.
    Dec(Count);
  end;
end;

Little known factoid: typed pointers have always been able to manipulated using the Inc() and Dec() standard procedures†. The above code is very contrived and could be replaced the following that will compile today:

procedure MoveRects(Rects: PRect; Count: Integer);
begin
  while Count > 0 do
  begin
    MoveRect(Rects^);
    Inc(Rects);  // This will scale the increment by the size of the element. 16 in this case.
    {Rects := Rects + 1;}  // The above is the same as this, which won't compile today.
    Dec(Count);
  end;
end;

Yes, this functionality can be considered very dangerous, especially where you can easily overrun of the end of a buffer. This should be used judiciously and prudently in some very narrow instances. If you don't want to use it, simply never turn on that directive and any code that tries to do this will not compile.

† There is an interesting bit of history with the Inc() and Dec() standard procedures. When the PChar was introduced, Inc() and Dec() were also modified to operate on a PChar variable. It turned out that it worked for all pointer types. Additionally, it also properly scaled the increment or decrement operation based on the size of the pointed-to element. Since a PChar pointed to a byte-sized element all the testing focused on that type. The product was released and it was then discovered that Inc() and Dec() had this additional behavior. The following release we decided to "fix" the problem. You would have thought we were the angel of death and had just killed the first-born of every household! The hew and cry from the field testers was overwhelming. We finally relented and put the "bug" back in and it has remained to this day as a "feature." I wonder what the reaction will be to this change? I predict there will be hails of praise from one camp, and sneering and guffaws from another. The largest camp will be the ones ambivalent to this change. To which camp to you belong?

Tuesday, January 22, 2008

Breaking the rules

Sometimes you just have to break the rules in order to learn something new. It is inherent in everyone, from the first time you ignored your parents about getting burned by the hot stove, to doing 10+ over the speed limit because "you can." Even though I mentioned in this post that you should never go mucking with the fields of a critical section structure, I just could not keep my grimy little hands off :-). One reason is that in Windows Vista, Microsoft introduced new condition variable APIs. My previous posts were about creating a full monitor object which is the combination of a mutual-exclusion lock (critical section) and a condition variable (wait/notify/notifyall). However there are cases where you would want to use more than one condition variable with the same lock. That is what these new Windows APIs allow you to do. After reading as much about the underlying implementation of these new bits of synchronization goodness, I found that they use a relatively new internal set of undocumented APIs for "Keyed Events." You can read this post by Joe Duffy for a high-level explanation. What if I could "backfill" these new APIs for older version of Windows within the DPL? I would need to break some rules in order to do that. I'd also have to forego any use of keyed events since those are only Windows XP and up (and not documented either). Windows 2000 would be out of luck. Let's get started with this hack-fest. For now let's ignore the "slim" reader/writer locks.

There are four key APIs to duplicate and one structure.

type
  PRTLConditionVariable = ^TRTLConditionVariable;
  _RTL_CONDITION_VARIABLE = record
    Ptr: Pointer;
  end;
  TRTLConditionVariable = _RTL_CONDITION_VARIABLE;

procedure InitializeConditionVariable(out ConditionVariable: TRTLConditionVariable);
procedure WakeConditionVariable(var ConditionVariable: TRTLConditionVariable);
procedure WakeAllConditionVariable(var ConditionVariable: TRTLConditionVariable);
function SleepConditionVariableCS(var ConditionVariable: TRTLConditionVariable;
  var CriticalSection: TRTLCriticalSection; dwMilliseconds: DWORD): BOOL;

The condition variable structure has one field which is not much space to store state information. My first thought was to simply allocate a structure that would be pointed to by the single field of the condition variable. The problem is there is no Delete/Destroy/FreeConditionVariable API. How would this structure be freed? Rats! That's a bust. This is where it is time to get all geeky and put on the propeller cap. If you recall from my discussion of the TMonitor, the Wait method would insert a little structure into a linked list that represented all the current waiters. We need to do something similar. There is a little twist here as well because the WakeCondtionVariable and the WakeAllConditionVariable are not required to be called from within the lock. In fact, all they take is the TRTLConditionVariable structure as the lone parameter. This makes managing that internal list of waiters a lot more tricky since we're not guaranteed to be in a lock while we need to update that list. This is compounded by the fact that the condition variable structure has only one pointer sized field. My first thought was to use a lock-free queue. The problem here is that we need two pointers, one to the head of the queue and one to the tail of the queue. I only have one pointer. Before we get into SleepConditionVariableCS where we're going to do some unholy things to the poor critical section structure, let's get the wait queue out of the way.

To set the stage we need a little review of field alignment. Alignment of fields of a structure, the local variables on the stack and the data on the heap is very important to making sure reads and writes are fast and atomic. Packing a structure that knocks the fields out of their "natural" alignment can sometimes have unwanted side-effects. Yes, there are times where you will get a few "dead bytes" between fields. A double word sized entity on the 32bit x86 is 4 bytes in size. This means that the "natural" alignment of this entity is on even 4 byte boundaries. As long as the entity falls on those boundaries, you know that the lowest 2 bits of any pointer to that entity will always be 0. We can exploit that fact with a bit of (OMG! is he going to do what I think he's going to do!!???) propeller-head bit twiddling. Now before you scroll down right now and hit the comment button and fire up the flame thrower, please bear with me.

Just like the Wait method of the TMonitor class, the SleepConditionVariableCS will simply place a structure allocated on the stack into a queue right before releasing the crit-sec, and block on an event. Because interacting with the queue does not use any inherent locking we need a thread-safe queue. We also only have that one field. How do we "lock" the list while it is being updated? The key is using one of those extra bits that we know will always be 0. If one of those bits is not 0, the list is locked. Here is the LockQueue and UnlockQueue functions:

type
  PWaitingThread = ^TWaitingThread;
  TWaitingThread = record
    Next: PWaitingThread;
    WaitEvent: THandle;
  end;

//UPDATE:  Slight modification based on comment
function LockQueue(var ConditionVariable: TRTLConditionVariable): PWaitingThread;
var
  ReleaseSlice: Boolean;
begin
  ReleaseSlice := GetProcessorCount < 2;  // This local could be eliminated by using a global
  while True do
  begin
    Result := PWaitingThread(Integer(ConditionVariable.Ptr) and not 1);
    if TInterlocked.CompareExchange(Integer(ConditionVariable.Ptr), 
      Integer(Result) or 1, Integer(Result)) = Integer(Result) then
      Break;
    if ReleaseSlice then
      SwitchToThread
    else
    asm
      PAUSE
    end;
  end;
end;

procedure UnlockQueue(var ConditionVariable: TRTLConditionVariable; WaitQueue: PWaitingThread);
begin
  ConditionVariable.Ptr := Pointer(Integer(WaitQueue) and not 1);
end;

It should be immediately obvious that the LockQueue function will never return until it can successfully lock the queue. It should also be noted that it will also spin-lock indefinitely, which can burn lots of cycles. Under a single proc system, doing a busy-wait is not cool so we should just release this thread's timeslice and let the scheduler get another thread to run. (Sleep(0); is not appropriate since that only allows threads with equal or greater priority to run, SwitchToThread is far more fair and much more like a traditional "yield"). Now that we have the tail of the linked list returned from LockQueue, we can manipulate the queue all we want. We just cannot "touch" the ConditionVariable.Ptr field and can only operate with the value returned from LockQueue. Once we're done we unlock the Queue by passing the potentially updated tail pointer to the UnlockQueue which will simply update the Ptr field of the ConditionVariable. About the performance implications; In theory, the queue should never grow very (one entry per waiting thread per condition variable) deep so the queue operations should be fast enough to not adversely affect performance (although I know it will a little bit). A final note about this technique is that it does not allow any type of recursion, so you much make sure there are no calls that could call LockQueue on the same condition variable while the lock is held.

If you haven't lost your lunch or broken out in hives by now, let's move on to the SleepConditionVariableCS function. It is commented similar to that of the TMonitor.Wait function.

function SleepCriticalSectionCS(var ConditionVariable: TRTLConditionVariable;
  var CriticalSection: TRTLCriticalSection; dwMilliseconds: DWORD): BOOL;
var
  WaitingThread: TWaitingThread;
  RecursionCount: Integer;
begin
  if CriticalSection.OwningThread = GetCurrentThreadId then
  begin
    WaitingThread.Next := nil;
    WaitingThread.Thread := CriticalSection.OwningThread;
    WaitingThread.WaitEvent := CreateEvent(nil, False, False, nil);
    try
      // Save the current recursion count
      RecursionCount := CriticalSection.RecursionCount;
      // Add the current thread to the waiting queue
      QueueWaiter(ConditionVariable, WaitingThread);
      // Set it back to almost released
      CriticalSection.RecursionCount := 1;
      TInterlocked.Add(CriticalSection.LockCount, -(RecursionCount - 1));
      // Release and get in line for someone to do a Wake or WakeAll
      LeaveCriticalSection(CriticalSection);
      // This is, admitedly, a potential race condition... what do you think?
      Result := WaitForSingleObject(WaitingThread.WaitEvent, Timeout) = WAIT_OBJECT_0;
      // Got to get the lock back and block waiting for it.
      EnterCriticalSection(CriticalSection);
      // Remove any dangling waiters from the list
      RemoveWaiter(ConditionVariable, WaitingThread);
      // Lets restore all the recursion and lock counts
      TInterlocked.Add(CriticalSection.LockCount, RecursionCount - 1);
      CriticalSection.RecursionCount := RecursionCount;
    finally
      CloseHandle(WaitingThread.WaitEvent);
    end;
  end else
    Result := False;  // should call SetLastError with the appropriate error code.
end;

Upon entry, we make sure that the calling thread owns the lock and if not return false (and as the comment suggests we should call SetLastError with the appropriate error code). Once we determine that the calling thread owns the lock, we need to start the process of tearing down the critical section and blocking on the wait event. The main difference between TMonitor.Wait and this function is that we still want to make sure that any threads waiting for the critical section are properly woken up, so we want to first tear down the critical section to within 1 count of being released, then call LeaveCriticalSection() which will do the final deed and unblock an waiters. We then immediately block on the wait event. Once we return from the wait either by being signaled or timing out, we have to re-acquire the critical section. This will only be a recursion level of 1, so we now need to restore any recursions to it. The QueueWaiter and RemoveWaiter functions use the LockQueue and UnlockQueue functions above to lock and manage the condition variable's waiting thread queue. Like the TMonitor.Pulse and TMonitor.PulseAll functions, the Wake and WakeAll functions are very simple.

procedure WakeAllConditionVariable(var ConditionVariable: TRTLConditionVariable);
var
  WaitingThread: PWaitingThread;
begin
  WaitingThread := DequeueWaiter(ConditionVariable);
  if WaitingThread <> nil then
    SetEvent(WaitingThread.WaitEvent);
end;

procedure WakeAllConditionVariable(var ConditionVariable: TRTLConditionVariable);
var
  WaitQueue, WaitingThread: PWaitingThread;
begin
  WaitQueue := LockQueue(ConditionVariable);
  try
    WaitingThread := DequeueWaiterNoLock(WaitQueue);
    while WaitingThread <> nil do
    begin
      SetEvent(WaitingThread.WaitEvent);
      WaitingThread := DequeueWaiterNoLock(WaitQueue);
    end;
  finally
    UnlockQueue(ConditionVariable, WaitQueue);
  end;
end;

For the WakeAllConditionVariable function, we want to atomically wake up all current waiters and don't want any intervening threads to be injected into the queue. So we have a special no lock version of DequeueWaiter which only takes the tail pointer passed by reference ("var" parameter). It will update the tail pointer appropriately, so when UnlockQueue is called, it should be nil. In fact DequeueWaiter simply locks the queue and then calls DequeueWaiterNoLock then unlocks the queue.

I'll totally understand if you have the sudden urge to take a shower, or to stick red-hot needles into your eyeballs... but sometimes it is just a little fun to break the rules (evil grin :-). I will not be surprised if this shows up on Coding Horror or The Daily WTF for public ridicule :-) As for an alternative implementation, all I want is an efficient user-level implementation of keyed events. That would eliminate the queue and make this function work more like the Vista version.

Friday, January 18, 2008

Simmering Unicode, bring DPL to a boil (Part 2)

As promised, I'll be covering the next group of methods on the TMonitor class. Wait, Pulse, and PulseAll. Before I do that, I want to quickly cover the spin-lock section of the Enter method from this post. (UPDATE: For the record, the TInterlocked class does not exist yet in any shipping VCL. You can simply replace these calls with the direct call to the Windows InterlockedXXXX functions).

It may seem counter-intuitive when you look at the loop in the first section of the Enter method. Why "waste" time in a loop when you could just block and release this thread's time-slice for some other thread? It has to do with the overhead and expense that goes along with blocking on a kernel object, returning to the OS thread scheduler and finally passing control to another thread. This could take literally thousands of CPU cycles and by the time this has all happened, the lock may have been released and the "waiting" thread can have the lock. Spin-locks are not for every case and if used improperly can actually do exactly what it looks like they do, waste cycles. The spin-lock will work best in situations where you have more than one CPU core, and more than one thread vying for the same resource in a relatively tight loop.

If a given core can just wait for a few cycles before doing the expensive and dreaded full-on kernel-level block, you can increase performance by drastically reducing user-level caused kernel-mode switches. Yes, the OS pre-emptive scheduler is still going to pre-empt the thread at various points, but those are things you cannot control. You can, however, control voluntarily giving up your time-slice. Remember, having a non-zero spin-count only makes sense for a system with more than one CPU core. In the case of systems with one CPU core, you must relinquish your time-slice in order to give the other "thread" a chance to actually complete its task. For this reason, the TMonitor makes sure there is more than one CPU core before allowing the internal FSpinCount field to be set to anything other than 0. Since the lock count is not a fixed value, that also means it is a tunable item. It allows you to "tune" the value that works best given the specific code in which it is used. The Windows critical section allows this value to be "tuned" in the same way using the InitializeCriticalSectionAndSpinCount API.

Some of the comments I've gotten from the previous post seem to me to indicate that some folks didn't read the whole post or simply missed a few key points. The whole point about that post was not that I'm trying to make a better critical section than what is provided by the OS. It was that until Windows Vista(and Win2008) you cannot use a critical section object very easily with other synchronization primitives. There was also some comments about how we shouldn't be spending any time on traditional locking since it is not "innovative." Not every little problem out there can (and should) be solved using things like continuation passing style, or other functional language-like constructs. Ultimately, some form of locking is going to be needed.

Let's start with the Wait function. This function's job is to atomically release the monitor lock and then block until some other thread calls Pulse or PulseAll. So you can actually be deeply nested in Enter calls on the monitor and then call Wait and the lock is completely released and the thread will block. This will allow another thread that may be blocked on the Enter call to gain the lock. Before Wait returns it will have re-acquired the lock and have properly restored the recursion state. Calling Wait implies a contract that some other thread must call Pulse or PulseAll before this thread can continue. If that never happens and the Timeout is INFINITE, it will block forever. Here's the Enter function:

function TMonitor.Wait(Timeout: Cardinal): Boolean;
var
  RecursionCount: Integer;
  WaitingThread: TWaitingThread;
begin
  WaitingThread.Next := nil;
  WaitingThread.Thread := CheckOwningThread;
  // This event should probably be cached someplace.
  // Probably not on the instance since this is a per-thread-per-instance resource
  WaitingThread.WaitEvent := CreateEvent(nil, False, False, nil);
  try
    // Save the current recursion count
    RecursionCount := FRecursionCount;
    // Add the current thread to the waiting queue
    QueueWaiter(WaitingThread);
    // Set it back to almost released
    FRecursionCount := 0;
    // This thread is no longer the owner either
    FOwningThread := 0;
    // Release and get in line for someone to do a Pulse or PulseAll
    if TInterlocked.Add(Integer(FLockCount), -RecursionCount) >= 0 then
      Result := SignalObjectAndWait(GetEvent, WaitingThread.WaitEvent, Timeout, False) = WAIT_OBJECT_0
    else
      Result := WaitForSingleObject(WaitingThread.WaitEvent, Timeout) = WAIT_OBJECT_0;
    // Got to get the lock back and block waiting for it.
    Enter;
    // Remove any dangling waiters from the list
    RemoveWaiter(WaitingThread);
    // Lets restore all the recursion and lock counts
    TInterlocked.Add(Integer(FLockCount), RecursionCount - 1);
    FRecursionCount := RecursionCount;
  finally
    CloseHandle(WaitingThread.WaitEvent);
  end;
end;

(Note: Before someone chimes in and complains about the heavy handed CreateEvent/CloseHandle calls... please read the comments).

Since Wait can only be called while holding the lock, the CheckOwningThread call will raise an exception if the calling thread doesn't own the lock. Now that we know we own the lock, we start the process of "unwinding" the lock. We need to save off the recursion count since this has to be restored exactly. The TWaitingThread type is merely a record and is strung together as a linked list among all the potential waiters by the call to QueueWaiter. In fact, QueueWaiter doesn't need to do any "lock-free" style linked list machinations since it can only be called while the lock is held so no other threads can get there. The queue is simply a FIFO (First-In-First-Out) in order to keep some level of fairness. Yes, I know, it is not totally fair if the waiting threads all have differing priorities. In general, this Wait locking style is used among several equal priority worker threads, so that priority-inversions are less common.

Once we've added this thread to wait queue, we need to set about relinquishing the actual lock and prepare to block on the event created above. This is done by explicitly setting the FRecursionCount and FOwningThread fields back to 0. The next tricky part is to now remove all the recursion counts from the FLockCount field. Since this field is touched from other concurrent threads, we need to interlock it. There is no Interlocked subtract function so we simply add the twos-compliment negation of the the RecursionCount. Just like in the Leave function we need to see if there are other threads currently waiting for the lock. If the result of the subtraction of the recursion count from the lock still yields a value greater than or equal to 0, there are waiters. In this case we need to use the nice atomic function SignalObjectAndWait. This function will atomically signal the lock's event handle and then turn around and immediately block on the event handle we created within this function. If there are no other threads waiting for the lock, there is no need to signal the lock event and a simple WaitForSingleObject is all that is needed.

Once the wait event is signaled by the Pulse or PulseAll method, or the wait times out, the first thing we need to do is reacquire the lock. Yes, Virginia, this may also mean that this method will block, again! That is the contract to the caller. When Wait returns the calling thread has the lock. Once we have the lock again, we need to do some house keeping and state restoration. The first thing is to remove this thread from the wait queue. This is really *only* necessary in the case where the wait above timed out. Nobody dequeued the WatingThread record and set the event so it is still in the queue someplace. RemoveWaiter does nothing if we were signaled. Once I've proven to myself that this is the only case where the WaitingThread record could ever still be in the wait queue, I can add some code to check the result and only call RemoveWaiter if the result is false.

After we've cleaned up the potentially "dirty" queue, we now need to restore the nesting or recursion state. Returning from the Enter call we know that the FLockCount field has our lock added to it and the FRecursionCount field will be 1. We need to add back in our recursive locks, less the one Enter already gave us. Because we hold the lock we can simply blast over the FRecursionCount field.

Like the Exit method, the Pulse and PulseAll methods are very simple:

procedure TMonitor.Pulse;
var
  WaitingThread: PWaitingThread;
begin
  CheckOwningThread;
  WaitingThread := DequeueWaiter;
  if WaitingThread <> nil then
    SetEvent(WaitingThread.WaitEvent);
end;

procedure TMonitor.PulseAll;
var
  WaitingThread: PWaitingThread;
begin
  CheckOwningThread;
  WaitingThread := DequeueWaiter;
  while WaitingThread <> nil do
  begin
    SetEvent(WaitingThread.WaitEvent);
    WaitingThread := DequeueWaiter;
  end;
end;

There is the CheckOwningThread call again. Pulse and PulseAll can only be called while holding the lock. Since we are holding the lock, the DequeueWaiter function doesn't need any complicated locking code on the linked-list queue. It simply removes an item from the queue and returns it. Pulse will simply release the first thread in the queue if one exists. An added check could be made to make sure there really are waiting threads. PulseAll will loop and dequeue all waiting thread records and release them all.

So there we go. A Win32 implementation of a monitor. The TMonitor class is, right now, a proof-of-concept and will probably look much different should it make it into a future version of Delphi. The current thinking is that the field variables should actually be somehow "attached" to any arbitrary object instance, thus enabling a whole object to be locked. We're also looking at integrating this concept into the language itself through the use of some sort of LockObject standard function. Lots of possibilities. Stay tuned.

Thursday, January 17, 2008

Simmering Unicode, bring DPL to a boil

Now that most of the ingredients are in the Unicode pot, I'm going to move it to the back burner and turn it down to a low simmer. We'll come back to it later and give it a few stirs and make sure the seasoning is right. Meanwhile I want to start work in the next course, DPL, or the Delphi Parallel Library. Remember this is merely a working name for all the work and research I've been doing around adding more support for parallel/concurrent processing. DPL, IS NOT and official feature of Tiburón.

As a part of the my research into parallel computing I've also been reading up whole bunch on the notion of lock-free data structures. These little bits of computer science wonders really help take some of the most common and mundane data structures and algorithms and make them much more friendly to the world of multi-core systems. However, they're also not a panacea and not all environments lend themselves to making the most out of them. Some of the best write-ups on lock-free algorithms are the ones from Julian Bucknall of Developer Express. He has some excellent write-ups on lock-free Queues and Stacks. Although these examples are written in C# and do depend on the nicely garbage collected universe in which it runs (.NET), the concepts mostly transfer. In one of the articles Julian even threatened that he'd translated some of these items into Delphi/Win32 code... Hmm... I wonder if he solved the one somewhat intractable problem about node deletion in the underlying linked lists? I'm still trying to work on that one myself, so Julian, if you've got something clever, let me know.

While trying to avoid locks is a laudable goal, at some point you're just going to have to use some form of locking. So I've been working on an implementation of a TMonitor class. .NET has another nice advantage here that the notion of a Monitor is intrinsic to any object instance in .NET. Eventually this is how I want the TMonitor to work as well. So what is a Monitor? It is basically the combination of a lightweight mutual exclusion lock and a condition variable. In fact the wikipedia article about a condition variable mentions the monitor as being that combination. Up until Windows Vista, Windows has not had any notion of a condition variable. Even Microsoft has admitted that the event based technique of using an auto-reset event with the PulseEvent API was fraught with problems. Even the PulseEvent documentation says you should no longer use it and instead use the new Condition variable APIs. Cool! Great! We now have intrinsic OS support for creating a full TMonitor class, right? Not so fast. The problem is that these APIs only appeared in Windows Vista and Server 2008! Grrrrr. I want my code to work at least with Windows 2000.

So, now I'm stuck with figuring out how to create a monitor without the benefit of the OS. So, off I went to research the bits an pieces of what I need to do this. I knew that POSIX threads (pthreads) have the notion of a condition variable so I wondered how those POSIX Win32 layers implement them. That was when I stumbled upon this cool article about implementing POSIX condition variables on Win32. It is a few years old but was also very informative. So off I went and was able to add a new TConditionVariable class to the SyncObjs.pas unit. The problem is that the underlying implementation, how shall I say it?, less than modern and is very academic. The other problem is that it depends on a kernel-level mutex handle as the surrounding lock. Mutexes are great if you want cross-process mutual exclusions or you need a handle on which to pass to a WaitForMultipleObjectEx call. Other than that, the real mutual exclusion work-horse is the critical section. Critical sections are process local, lightweight and fast. The nice thing about them is that if there is no contention (only one thread tries to acquire the crit-sec at a time), no locking happens. This means the code executes purely in user-space and no expensive kernel-mode switch is needed. Another problem is that critical sections are (supposed to be) opaque structures. Even though the Windows API headers clearly show all the fields and give them nice names like LockCount, RecursionCount, etc... one should not muck with them directly. If for whatever reason, Microsoft decided to change the structure or change the underlying implementation, any code that directly touched those fields is asking to to be spanked, and spanked hard.

What to do... what to do... While the specific critical section structure and related APIs are opaque, the underlying implementation is fairly straight forward. The tricky part will be in adding the condition variable-esque aspects, notably the Wait/Notify(Pulse)/NotifyAll(PulseAll) functionality. Lets start with the mutex (critical section) bits. We need only three APIs for this, Enter, Leave(Exit), and TryEnter. Just like the OS' critical section we need a few fields to track the state of the critical section. Here's the declaration of what is needed so far:

TMonitor = class sealed
strict private
  FLockCount: Integer;  // Initialized to -1, no lock holders
  FRecursionCount: Integer; // Keep track of how many times we've reentered
  FOwningThread: Cardinal; // Who currently owns the lock
  FSpinCount: Integer; // Special busy-wait spin-lock for multi-proc systems
  FLockEvent: THandle; // The auto-reset event on which to wait when in contention
  function GetEvent: Cardinal;
  function CheckOwningThread: Cardinal;
  class var FProcessorCount: Integer;
  class function GetProcessorCount: Integer; static;
public
  constructor Create; overload;
  constructor Create(ASpinCount: Integer); overload;
  destructor Destroy; override;

  procedure Enter; overload; inline;
  function Enter(Timeout: Cardinal): Boolean; overload;
  procedure Exit;
  function TryEnter: Boolean;
end;

Let's start with the TryEnter function since is the simplest and most straight forward since it never blocks the calling process. Here it is including the comments! (wow, he commented the code :-).

function TMonitor.TryEnter: Boolean;
begin
  // First things first - initially check to see if we can gain ownership
  if TInterlocked.CompareExchange(FLockCount, 0, -1) = -1 then
  begin
    //  Yep, got it.  Now claim ownership
    FOwningThread := GetCurrentThreadId;
    FRecursionCount := 1;
    Result := True;
  end else if FOwningThread = GetCurrentThreadId then  // check for recursion
  begin
    // We're recursing now, but FLockCount still needs to be guarded
    TInterlocked.Increment(FLockCount);
    // Only the owning thread can increment this value so no need to guard it
    Inc(FRecursionCount);
    Result := True;
  end else
    Result := False;
end;

The first thing we want to do is just try to get the lock. That is what the CompareExchange does. (oh, BTW the TInterlocked class is just a collection of thin class static functions that wrap the OS interlockedXXXX functions). CompareExchange is more commonly referred to as a Compare and Swap or CAS for short. This atomic function is the basis by which nearly all other atomics can be implemented. Learn it. Understand it. It is that important. Since the FLockCount is initialized to -1, we try to swap in a 0 if and only if FLockCount is a -1. This is done as an atomic operation which means while this operation proceeds, no thread switching, or other memory modifications can occur from anywhere (other CPUs, DMA operations, etc.). The CPU Bus is locked down for the duration of that one operation which is typically a single CPU instruction, CMPXCH on the the x86 platform. This is extremely important for this all to work. If CompareExchange returns a -1, then we know that the original value of FLockCount was -1 and the 0 was successfully "swapped" in. If it returns anything other than -1, someone else has already gotten here first or we already hold the lock. We check the latter by looking at the FOwningThread field and comparing it to the current thread's Id. If we already own the lock, then this is a recursion case and we just need to do some housekeeping and return. Notice that we don't need to use the Interlocked inc on the FRecursionCount field since only the owner of the lock can touch that field. If some other thread owns the lock, then we simply return indicating that we cannot get the lock.

If TryEnter was the vegatables for this course, then Enter is the meat. This is where things can get interesting. Here's the Enter function in all it's glory.

function TMonitor.Enter(Timeout: Cardinal): Boolean;
var
  SpinCount: Integer;
begin
  // Get the spin count
  if FSpinCount > 0 then
  begin
    Result := TryEnter;
    if Result then
      System.Exit;
    SpinCount := FSpinCount;
    while SpinCount > 0 do
    begin
      // if there are already waiters, don't bother spinning
      if FLockCount > 0 then
        Break;
      // Try to get the lock
      if FLockCount = -1 then
        if TInterlocked.CompareExchange(FLockCount, 0, -1) = -1 then
        begin
          FOwningThread := GetCurrentThreadId;
          FRecursionCount := 1;
          Result := True;
          System.Exit;
        end;
      asm
        REP NOP // Just do nothing here...
      end;
      Dec(SpinCount);
      // Keep trying until the spin count expires
    end;
  end;
  Result := True;
  // Add our count to the lock
  if TInterlocked.Increment(FLockCount) > 0 then
  begin
    // Now check if we're the owner
    if FOwningThread = GetCurrentThreadId then
      // Yep, we're the owner so just recurse
      Inc(FRecursionCount)
    else
    begin
      // We're not the owner, so blocking is needed
      // GetEvent does a "safe" allocation of the Event
      Result := WaitForSingleObject(GetEvent, Timeout) = WAIT_OBJECT_0;
      if Result then
      begin
        FOwningThread := GetCurrentThreadId;
        FRecursionCount := 1;
      end else
        // We timed out, remove our presence from the lock count
        TInterlocked.Decrement(FLockCount);
    end;
  end else
  begin
    FOwningThread := GetCurrentThreadId;
    FRecursionCount := 1;
  end;
end;

Let's ignore the spin-lock at the beginning of this function for now. That section deserves to be covered separately. Starting right after the spin-lock loop, we first try to add our count and/or acquire the lock with a simple interlocked inc of the FLockCount field. If this lock was not already owned, the result will be 0 (remember it starts at -1) so the else section is taken and we claim ownership of the lock. If the result is greater than 0, then that means someone else has to the lock or we're in a recursion. We do the same FOwningThread check as was done in TryEnter and if we already own the lock simply add to the recursion count. If we don't own the lock, then we're in contention and have to block. Unlike TryEnter, Enter either returns and the lock is now owned by the calling thread, or it timed out and never was able to get the lock. This timeout is something that the OS' critical section does not support. We'll cover it in a moment, but another neat concurrency trick is in the GetEvent method. We call WaitForSingleObject on the event handle, which is an auto-reset event, with the timeout the caller specified. The overloaded/inlined Enter method simply calls this version of Enter with INFINITE as the timeout. If the wait returns that the object was signaled that means we now own the lock (FLockCount should be >= 0 at this point) and we can claim it by setting the FOwningThread and the initial recursion count. However, if we timed out and could not get the lock, we simply need to decrement the FLockCount field to remove the fact that we were waiting in line for the lock.

The simplest and easiest to understand function is Exit. Here's what it looks like:

procedure TMonitor.Exit;
begin
  CheckOwningThread;
  Dec(FRecursionCount);
  if FRecursionCount > 0 then
    TInterlocked.Decrement(FLockCount)
  else
  begin
    FOwningThread := 0;
    if TInterlocked.Decrement(FLockCount) >= 0 then
      SetEvent(GetEvent);
  end;
end;

The CheckOwningThread thread is simply a function that ensures that the calling thread does in fact own the lock. It will raise an exception if that is not the case because Exit is only valid to be called if the lock is owned by the calling thread. If we do own the lock (an exception wasn't raised), we just decrement the FRecursionCount field. Remember, this is safe because only the lock owner can touch that field. Now we check if we still have recursion counts because Enter was called more than once. If so we just need to remove our presence from the FLockCount field. If FRecursionCount drops to 0 (we still "own" the lock) we know we need to release the lock and make sure one of the waiters are woken up and given the lock. We clear the FOwningThread field to relinquish the claim of ownership and then remove the final count on the FLockCount field. If there are still counts on it the result of the Interlocked decrement will be greater than or equal to 0 which means there is at least one thread blocked waiting for the lock. In that case we just set the event and then return.

What about the GetEvent method I mentioned above? This is where our good friend the CAS function really helps us out. Here's the GetEvent method:

function TMonitor.GetEvent: THandle;
var
  Event: THandle;
begin
  Result := FLockEvent;
  if Result = 0 then
  begin
    Event := CreateEvent(nil, False, False, nil);
    Result := TInterlocked.CompareExchange(Integer(FLockEvent), Event, 0);
    if Result = 0 then
      // We won!  Nobody else was trying to allocate the Event.
      Result := Event
    else
      // Oh Well. We tried. Close the handle if someone got to it first.
      CloseHandle(Event);
  end;
end;

The first thing we do is simply check if it has already been allocated. If so, just return it. If not, we now enter a race to see which thread gets to create the event. This is an intentional race and if you look closely at the code the result of the CAS operation tells us whether or not we won the race. So we create the new event object, then try and swap it into the FLockEvent field. The CAS will make sure FLockEvent is 0 and of so assign the Event value to it. It will then return the original value of FLockEvent. If by the time the CAS operation is performed, another thread was able to assign the FLockEvent field, then the CAS operation won't change the FLockEvent field. It will still return the original value of FLockEvent which is assigned to the result. This leave the dangling Event local variable, which we simply close it to release it back to the OS. Theoretically, there could be as many events as there are threads running the GetEvent method created, but only one will ever be assigned to the FLockEvent. Remember, the code within the if block will only execute if at the point of the test, the FLockEvent field isn't assigned. Once some thread is able to allocate the event, the next time this function simply grabs the value of the field and returns it.

There you have it, we just recreated a lightweight critical section with the same advantages that the OS critical section possess. It tries to remain in user-space for most things and only uses a kernel object when there is contention for the resource. Next time I'll cover the spin-lock in the Enter method and explain how it can serve to boost performance on multiple processor (or multi-core) systems. I'll also present the Wait/Pulse/PulseAll methods.

NOTE: Yes, I know that there is a chance that the CreateEvent call in GetEvent can fail if the system is low on resources. There is an excellent explanation on Joe Duffy's blog about this along with some interesting tidbits about the new condition variables in Vista. If you're interested in threading, concurrency, and parallelism, Joe's blog is a must-read. Just today he's posted some random thoughts on handling blocked threads and work schedulers. Which is rather interesting because all this stuff about a TMonitor class will eventually come back around to my latest foray into creating a thread pool... which is actually along the lines of his #2 commonplace solution. Until Delphi gets some kind of continuation passing style (CPS) that will probably continue to be the solution.