Wednesday, October 31, 2007

I cut and I cut and it was still too short!

In keeping with the notion that I'm highly fallible and flawed human like so many others I just have to relay an important epiphany I had while playing some more with thread pools, parallelization of loops, and using a different thread pool technique. Actually, more than a week ago I'd planned on writing up this simply simply stellar posting about how "simple" paralleling can be done and the overall results of some "research" I'd been doing (you know, the 'R' in R&D). I'd even posted a little challenge for folks in this message. However, I'd actually spent the last few weeks becoming even more follicularly challenged.

I've discovered a crucial bit about how you divide a task up into smaller chunks for dispatching into a thread pool. The technique you use for chunking the task is highly dependent on the nature of the task itself [and the crowd sighs with a huge DUH!!!]. Obviously your algorithm cannot have interdependencies between the iterations, directly or indirectly. I got that one. Here's what I discovered.

I took my thread pool class and wrote a ParallelFor function that takes a low and high bound and a method pointer (this will become a generic function once we have Win32 generics fully functional). The routine would divide up the range into smaller "chunks" based on the number of processors available and then dispatch these smaller chunks into the thread pool. It would then block on a list of events until they were all signaled. This meant that all the sub-tasks were complete and the function can return. Works great!

Now I needed a sufficiently long running algorithm that also had the trait where each iteration is independent from all the others. I also wanted one that folks were familiar with. I've got it, the Sieve of Eratosthenes! Simple, iterative algorithm, right? Or so I thought. The first step was to code it up in a single-threaded manner and test it, get a base-line on the performance, and make sure I got the algorithm correct. Easy. Check. Now I need to figure out how to use my new handy-dandy "Ginsu Knife" ParallelFor procedure. So the sieve consists of an outer loop and an inner loop. Just pass the outer loop range to ParallelFor and then in the method callback perform the inner loop. Simple. I whipped this all up into my test application where I'd done some very simple "QueryPerformanceCounter()" profiling. I already knew the baseline performance of the algorithm, so I was all ready to see both cores on my "Yonah" based Dell D820 light up. You know that sinking feeling in your gut when you're all ready for something big and when it gets to the moment, it is a huge let-down? That's just what happened... The performance was nearly identical to the single-threaded version!

I spent the next couple of weeks on and off doing things like rewriting my thread-pool using this interesting technique referred to me by Chee Wee. I adjusted the number of chunks used for my ParallelFor function. Disabled SpeedStep on my machine. Verified that both cores were, in fact, being exercised. I wrote a little assembler function using the cpuid instruction to identify each core and make sure I'm seeing some distribution between the cores. Yep, both cores were lighting up... but why wasn't the performance any better??? Logic would dictate that it should. I had conservative expectations. Anywhere between 15 and 30% improvement would be great. Close to 50% would be perfect. I made sure the array was aligned... wait that doesn't matter since it was an array of bytes. It's mostly unaligned in that case. So I made it an array of Integers... nope virtually no performance change! Back to array of Byte. Maybe I need to increase the range of primes to calculate since there is some overhead in the ParallelFor procedure. So I ultimately increases it to 1million. Took longer, sure. But it was still the same performance as the single threaded version!!! By now I certain I'm destined to be immortalized for all to see on the the DailyWTF

Ok, time to drag out the big guns! So I fired up my trusty AQTime. After a couple of runs, I started noticing something interesting. Of the two threads being dispatched into, one was eating significantly more CPU cycles than the other. Why was that?? I rechecked the chunking in my ParallelFor function... nope it evenly divided the range up just fine. (Ok, the much more astute among you have long ago figured out my error) Finally, I went back and analyzed the sieve algorithm... Suddenly I realized what the problem was.

If you look at how the Sieve of Eratosthenes actually works, it doesn't really calculate which numbers are prime but rather which ones are not prime. When it is done, each element of the array that is not marked, are the prime numbers. So as the algorithm runs it is figuring all the numbers that have the current iteration's value as a factor. For instance, in a given finite range there are far more numbers that have 2 as a factor than say, 10. By chunking the algorithm in even linear chunks, the lower range has to do more work than the higher range.

The solution in this case was to provide a way to interleave the work among multiple threads such that each thread takes a fair share of the high-factor lower integers. Once I rewrote the ParallelFor algorithm to do this and retested the application I was greeted with a performance improvement very close to 50%! I'd finally got each core equally lit up!

In conclusion, based on this information, it is very clear that the way you break apart an algorithm is highly dependent on the nature of that algorithm. In the case of the Sieve of Eratosthenes, an interleaving algorithm is the optimal way to make it parallel. However, in the case of, say, doing an alpha-blend over a large, in memory image, the larger linear chunks may work just fine. Suppose you did have an algorithm that did have dependencies on adjacent iterations? In this case, the large linear chunk version may work better because you only need to worry about the edges. However, even in that case, such an algorithm is in danger of degrading back to being serial because you cannot calculate iteration 100 before you finish iterations 1-99. Sometimes you can minimize these interdependencies at the expense of using more memory.

Another interesting algorithm to make parallel is Conway's Game of Life. In this case you calculate the new generation based on a static version of the previous generation. I think I'll tackle that one and see if interleaving or chunking makes any difference. I'll be back with a report. This multi-core/parallel processing stuff is interesting, challenging, and overall forces you to think a little differently about your problem space.

Tuesday, October 30, 2007

You want to do what?

I always get a kick out of reading Raymond Chen's blog, The Old New Thing.  So many of his posts hit home way too often :).  One of his latest posts simply highlights one thing I always try to do and get folks to do when they ask for support or help with such-and-such.  I've caught myself doing this same thing many times.  Don't ask how to implement your grandiose solution, simple present the problem you're trying to solve and then present your proposed solution. 

Why present the problem?  You've already thought about it (I presume) up one side and down the other.  I should not have to bother this person with my problems;  they're mine to solve, dag-nabit!  First of all, why are you even asking for help?  If you're suppose to be "God's gift to programming," why are you seeking advice?  Your not?  You're a fallible, imperfect human?  Me, too!  Good, now that we're all on the same level here, back to why should you present the problem?  When you present and frame the question in terms of the problem you're attempting to solve, you have the highest chance of getting the information you want.  In fact, in cases too numerous to count, I've gotten better solutions to my original problem than my own "awe-inspiring" solution.

I see this same thing all too often in the peer-support forums and newsgroups.  A person will pop in and asking for help with how to implement their solution, only to be disappointed by the responses due to a lack of common information.  Too much inference is going on.  The same is also true of QualityCentral reports.  Rather than simply stating that such-and-such is broken, a detailed explanation along with detailed steps go a long way to getting your point across.  My parents always used to tell me, "To be terrific, you must be specific."

Wednesday, October 17, 2007

Enumerators, Inlining, methods on records, for-in and deadlocks

The ever-present, over-achieving Hallvard Vassbotn has just posted an excellent analysis of how the Delphi compiler generates code for the "for-in" statement. He also presented many suggestions on how to make this as efficient as possible.  I will be taking his suggestions back to the team to get folded into the next release.  You should also be sure to read Primoz Gabrijelcic's blog posts on writing enumerators.

 

I noticed that I haven't gotten any "bites" on my challenge to spot the deadlock...  Since returning to the office after my trip to Kona, HI, I've been working on several things not related to multi-core, thread-pooling, paralleling, etc...  Mostly surrounding the next Delphi/C++Builder release.  I'll come back to threading soon.

Friday, October 5, 2007

Spot the deadlock

This is the final full day I'll be here in Kona, Hawaii for the ANSI/ISO C++ standards meeting and they're just going through some of the minor issues, or actually "bugs" in the standard, I've been playing with the thread pool class from my post about thread pools. I included a small project to demonstrate the concept. However, in working with it, I found a subtle deadlock condition. Can you spot it? I'll give you a hint: SendMessage. I think this highlights the complexity that can creep into even the seemingly simplest of multithreaded code. Of course I wasn't being particularly careful while writing the code as it was only a conceptual exercise.

One common thing you may want to do with thread pools is to dispatch several tasks to the pool, do some other work, and then wait for the result. Another twist is that you want to continue working in the UI and then at some point in the future, a method is called when the set of tasks are completed. I've been adding some code to the SyncObjs.pas unit, and specifically the THandleObject class. I've added a class function called WaitForMultiple(). This allows you to pass in an array of THandleObject class instances on which to wait. You can wait for one or all. This is just a wrapper around WaitForMultipleObjectsEx. In the thread pool class I've added a parameter to the QueueXXWorkItem methods allowing you to optionally pass in a TEvent object that will be signaled once the work item completes. Since you may want to dispatch many work items, all of which must be completed before continuing, you'd call THandleObject.WaitForMultiple() or a new class method on the TThreadPool class.

An interesting thing to note is that if the UI thread blocks waiting for the work items, it seems that the items are not dispatched into the worker threads at all and the application hangs. There is nothing in the MSDN documentation for QueueUserWorkItem. If I perform the wait on a separate wait thread, it seems to work OK. Any ideas? Are the tasks dispatched onto the worker threads via synchronization with the message queue? I would have figured that there was a control thread that waits for things to exist in the work item queue and then does the dispatching.

Thursday, October 4, 2007

ANSI/ISO C++ meeting Day 4

Shameless namedrop... sitting across the table from Herb Sutter(from MS) and Bjarne Stroustrup (from AT&T labs) in the Evolution/Concurrency Working group discussing http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2407.html.  Just to make it clear to the Delphi and C++Builder folks... we've had this for many years in the form of packages.  It's good that the standards bodies are getting on board with this concept.  Of course there are many issues with defining a standard that can be implementable among various platforms, mainly Unix (and similar) and Windows...  Well it looks like this proposal has died in committee due to the fact that many felt that they would be unable to finalize the proposal in time to get a new standard out by 2009 (aka. C++09, from C++ox).

Next discussion is more informal and is related to garbage collection... should be interesting.  Apparently there will be a minimal specification of garbage collection in the next standard (at least at this point).  Main sticking points are about obscured pointers, leak detection, and destructors.  No, no... GC, as defined in the original proposal, is not in. Geez, it is hard to track all the different proposals and their status as they wind their way through the various subgroups.  Looks like they've decided to put in some special library functions to allow certain annotations to give "hints" to any collector about whether or not a memory region contains or doesn't contain reachable pointers.

Great googly-moogly, now the discussion is about the definition of "may"...  now they're discussing that "need not" is the negative form of "may."  Stop before I go insane :-).

Now in the afternoon session where they're finally trying to nail down the final proposal for the threading library.  Working on some interesting problems with the try_lock() function.  It seems that there are cases where try_lock can return false even if the lock is available.  The details are somewhat abstract and platform specific.  In properly written code, a "spurious failure" of try_lock() is not really a problem.  The one degenerate case as presented by Hans Boehm (from HP) (translated to Delphi code):

var
  x: Integer;
  M: TMutex;
 
thread #1 thread #2
x := 1;
M.Lock;
while (M.TryLock) do
  M.Unlock;
Assert(x = 1);

It was quickly pointed out that this is considered "bad" programming practice.  The issue is that there is actually a race because if try_lock() fails (and failure in this case is the notion of not acquiring the lock even though it is available).  On multi-processor/core systems with guaranteed cache coherency, it is hard to see a case where a "compare and swap" (the typical underlying mechanism for implementing a mutex) operation would fail spuriously.  I do agree that with properly written code (NOT the code above :-)), this should not be a problem.

Wednesday, October 3, 2007

ANSI/ISO C++ meeting Day 3

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2320.html - Went over this paper some more and rehashed the issue surrounding whether or not the thread object should detach or join the underlying OS thread.  Join and detach are POSIX thread notions and roughly correspond to Windows WaitForSingleObject(<thread handle>) and CloseHandle(<thread handle>).  Looks like their going with the detach rather than the join, mainly because they had no known prior art where a thread object does a join in the destructor.  Alisdair, at my prompting, did make it known that the Delphi TThread object does, in fact, do a join in the destructor.  So.. there's my contribution to the C++ standards process... they're still going with the detach, but it was a long, long discussion.

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2410.html - This paper was just covered in the final session of the day.  What is interesting is how they define thread-safety guarantee.  There is the notion of basic thread-safety and strong thread-safety.  Adding "strong" thread safety in which a globally shared object can be mutated by more than one thread in a safe manner has a significant negative impact on performance.  There are so many notions of "thread-safety" that it is so hard to nail down one... some are small in scope where they're only dealing with atomic operations on single "memory-locations" and then the much broader notion that many folks seem to relate with is the strong notion.  The problem is that this "strong" notion is not actually the case where "I can write my program in any way I see fit without regard to how the threads interact."  Even all the work going on to leverage multi-core systems both in the libraries and even intrinsically by the compiler, these are not "strongly" thread-safe in the broad naive notion.

Now they're talking about random number generators...  Cool.. I needed a nap :-).

Tuesday, October 2, 2007

ANSI/ISO C++ meeting Day 2

OK, it is actually getting somewhat interesting.  Aside from the myriad of straw-polls, and various other procedural items, the technical discussions have some interesting bits.  We're just finished discussing atomics per this paper: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2393.html.  It was actually moderated by one of the authors, Lawrence.

For the Delphi folks on the x86 architecture, it is generally assumed that most memory operations are sequentially consistent without the need for specific memory fences (this is in user-mode where Delphi apps typically live).  However for some other architectures like PowerPC, Itanium and others, explicit memory barriers need to be inserted.  The notion of atomics, codifies this distinction whereby any "normal" variable does not have necessarily acquire/release semantics.  However an atomic variable does have this behavior.  It turns out that the C++ volatile modifier is wholly inadequate for describing the notion of acquire/release.

Another funny item that became a sticking point was whether or not bool (aka. Boolean in Delphi parlance) is bitwise comparable.  Since a bool has one and only one value for "false" (zero[0] typically), yet many bitwise values for "true" (non-zero[0]), you cannot guarantee an atomic "compare and swap" for a bool type.  It is intriguing that such a seemingly simple, fundamental type can present such a problem.

We're now talking about this proposal for a multi-threading library: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2320.html.  Being presented by one of the authors, Howard.   Although the version being presented has had the thread cancellation section removed and explicitly made undefined.  This is because of the notion that cancellation is done via an exception signal... which could be caught and effectively cancel the cancel request...This ended up being rife with other issues (besides the fact that a thread can deny a cancellation request, nay, demand), like intermixing non-C++ frames or even intervening OS frames. 

Discussion of condition variables, which Windows doesn't have any intrinsic primitive (not until Windows Vista), has been somewhat interesting (and another sticky point).  Pete Becker (former Borland C++ RTL guru)  has raised the issue that the current proposal seems to be designed with a bias to POSIX (through BOOST) and unnecessarily complicates the implementation of condition variables for systems that don't have a native conditional-variable-like thingy.  In looking at potential implementations for a condition variable on Windows in the absence of native condition variables, it becomes abundantly clear that this isn't an easy problem to solve.  Geez.. the machinations that these various algorithms go through is astounding.  However, a condition variable construct makes implementing things like thread-safe message queues and stacks where the producer want to know "not full" state and the consumer is interested in the "not empty" state.

 

Afternoon session:

OK, we just spent over an hour wrestling with the notion of whether or not a reference is also a memory location (as defined by this document: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2334.htm).  Even though you cannot take the address of a C++ reference, it may in fact occupy a memory location.  So given that, should it share the same semantics as a regular memory location?  It all sounded so metaphysical for such an analytical group.  I understand the arguments on both sides, however the one argument I could not disagree with is that unless you call it out, some "clever" compiler writer may decide to do something "special" with references regardless of whether or not they live as a memory location or not (references can be collapsed by optimization).  Finally past that... now we're trying to define what a "thread" is...  Pretty soon my head is going to explode :-).  blah blah blah...

 

I think that is it for day two... we'll see if I have the gumption to say anything about day 3 :-)

Monday, October 1, 2007

ANSI/ISO C++ meeting Day 1

Two words.  Boring.  Tedious.  This is simply where previous meeting minutes are approved by the voting members, the setting of the agenda for the week and discussions of specific mechanical meetings issues.  This includes when we're going to get WiFi in the meeting rooms.  I don't particularly care since I'm using my cell phone as a data channel :-).  If I can stay awake, I'll try to report some more...  Once we get into the breakouts where the actual concepts are being discussed, I'm sure things will perk up.