Wednesday, October 31, 2007

I cut and I cut and it was still too short!

In keeping with the notion that I'm highly fallible and flawed human like so many others I just have to relay an important epiphany I had while playing some more with thread pools, parallelization of loops, and using a different thread pool technique. Actually, more than a week ago I'd planned on writing up this simply simply stellar posting about how "simple" paralleling can be done and the overall results of some "research" I'd been doing (you know, the 'R' in R&D). I'd even posted a little challenge for folks in this message. However, I'd actually spent the last few weeks becoming even more follicularly challenged.

I've discovered a crucial bit about how you divide a task up into smaller chunks for dispatching into a thread pool. The technique you use for chunking the task is highly dependent on the nature of the task itself [and the crowd sighs with a huge DUH!!!]. Obviously your algorithm cannot have interdependencies between the iterations, directly or indirectly. I got that one. Here's what I discovered.

I took my thread pool class and wrote a ParallelFor function that takes a low and high bound and a method pointer (this will become a generic function once we have Win32 generics fully functional). The routine would divide up the range into smaller "chunks" based on the number of processors available and then dispatch these smaller chunks into the thread pool. It would then block on a list of events until they were all signaled. This meant that all the sub-tasks were complete and the function can return. Works great!

Now I needed a sufficiently long running algorithm that also had the trait where each iteration is independent from all the others. I also wanted one that folks were familiar with. I've got it, the Sieve of Eratosthenes! Simple, iterative algorithm, right? Or so I thought. The first step was to code it up in a single-threaded manner and test it, get a base-line on the performance, and make sure I got the algorithm correct. Easy. Check. Now I need to figure out how to use my new handy-dandy "Ginsu Knife" ParallelFor procedure. So the sieve consists of an outer loop and an inner loop. Just pass the outer loop range to ParallelFor and then in the method callback perform the inner loop. Simple. I whipped this all up into my test application where I'd done some very simple "QueryPerformanceCounter()" profiling. I already knew the baseline performance of the algorithm, so I was all ready to see both cores on my "Yonah" based Dell D820 light up. You know that sinking feeling in your gut when you're all ready for something big and when it gets to the moment, it is a huge let-down? That's just what happened... The performance was nearly identical to the single-threaded version!

I spent the next couple of weeks on and off doing things like rewriting my thread-pool using this interesting technique referred to me by Chee Wee. I adjusted the number of chunks used for my ParallelFor function. Disabled SpeedStep on my machine. Verified that both cores were, in fact, being exercised. I wrote a little assembler function using the cpuid instruction to identify each core and make sure I'm seeing some distribution between the cores. Yep, both cores were lighting up... but why wasn't the performance any better??? Logic would dictate that it should. I had conservative expectations. Anywhere between 15 and 30% improvement would be great. Close to 50% would be perfect. I made sure the array was aligned... wait that doesn't matter since it was an array of bytes. It's mostly unaligned in that case. So I made it an array of Integers... nope virtually no performance change! Back to array of Byte. Maybe I need to increase the range of primes to calculate since there is some overhead in the ParallelFor procedure. So I ultimately increases it to 1million. Took longer, sure. But it was still the same performance as the single threaded version!!! By now I certain I'm destined to be immortalized for all to see on the the DailyWTF

Ok, time to drag out the big guns! So I fired up my trusty AQTime. After a couple of runs, I started noticing something interesting. Of the two threads being dispatched into, one was eating significantly more CPU cycles than the other. Why was that?? I rechecked the chunking in my ParallelFor function... nope it evenly divided the range up just fine. (Ok, the much more astute among you have long ago figured out my error) Finally, I went back and analyzed the sieve algorithm... Suddenly I realized what the problem was.

If you look at how the Sieve of Eratosthenes actually works, it doesn't really calculate which numbers are prime but rather which ones are not prime. When it is done, each element of the array that is not marked, are the prime numbers. So as the algorithm runs it is figuring all the numbers that have the current iteration's value as a factor. For instance, in a given finite range there are far more numbers that have 2 as a factor than say, 10. By chunking the algorithm in even linear chunks, the lower range has to do more work than the higher range.

The solution in this case was to provide a way to interleave the work among multiple threads such that each thread takes a fair share of the high-factor lower integers. Once I rewrote the ParallelFor algorithm to do this and retested the application I was greeted with a performance improvement very close to 50%! I'd finally got each core equally lit up!

In conclusion, based on this information, it is very clear that the way you break apart an algorithm is highly dependent on the nature of that algorithm. In the case of the Sieve of Eratosthenes, an interleaving algorithm is the optimal way to make it parallel. However, in the case of, say, doing an alpha-blend over a large, in memory image, the larger linear chunks may work just fine. Suppose you did have an algorithm that did have dependencies on adjacent iterations? In this case, the large linear chunk version may work better because you only need to worry about the edges. However, even in that case, such an algorithm is in danger of degrading back to being serial because you cannot calculate iteration 100 before you finish iterations 1-99. Sometimes you can minimize these interdependencies at the expense of using more memory.

Another interesting algorithm to make parallel is Conway's Game of Life. In this case you calculate the new generation based on a static version of the previous generation. I think I'll tackle that one and see if interleaving or chunking makes any difference. I'll be back with a report. This multi-core/parallel processing stuff is interesting, challenging, and overall forces you to think a little differently about your problem space.


  1. Hi Allen,

    Thanks for the post. Sounds like you are putting together something quite useful for parallization of certain tasks into a thread pool.

    Do you plan supplying the code for the framework you have put together (perhaps with the Seive example above) for the mere mortals among us to crib ;-)


  2. C Johnson,

    I couldn't agree more, however there *is* value in providing a set of building blocks that go a long way to taking some of the drudgery out of it. Ultimately, when user code is running in a thread, it is that user's responsibility to ensure that they don't inadvertently shoot themselves in the foot.

    Another danger is that a library does just enough to isolate the user from the underlying mechanics that it give a false sense of security when BAM! They're lost in a sea of threads with no clue about how to find dry land...


  3. You do mean 100% improvement, not 50%, right? Twice as many processors doing the same work.

    This is interesting stuff. Thanks for taking the time to write it up.

  4. Jim,

    I guess I should have said a 50% drop in total processing time. If I had a quad-core system, I'd expect to see about a 75% drop in total time.


  5. You have not been reading up on ParallelFX by any chance?


    When I read that (pFX), I also thought that it would be a great addition to Delphi...

  6. Tjipke,

    Thanks for the reference! That's a great article with a lot of excellent information. The funny thing is that I got a major sense of deja vu' when I read over it :-). Now all I need to get added to Delphi is anonymous methods. That coupled with generics and this is getting really interesting.


  7. Yes, I was going to point to th ParallelFX too. There is also a video with Anders H. and Joe Duffy:

    The key improvement here is that the scheduled tasks (division of the for loop) is independent from the physical processor they will run on. So if some chuncks take longer to run than others, more tasks will be automatically given idle processors.

    Also, a processor that blocks (during IO, for instance) can pick up other tasks in additional threads.

    Another reference is my old threading article from The Delphi Magazine:
    "Knitting Your Own Threads" in issue #40 (December 1998).
    The key there is to extend the VCL message loop to allow the main thread to react on event and semaphore signals without blocking and without missing any messages. There is also code to cleanly communicate between threads. Code is here:

    TDM is closing down, and I'm planning to republish some of my old articles on my blog beginning of next year... ;)

  8. There's ISAPIThreadPool unit which implements a thread pool using completion ports.

  9. "I believe" (I can here Donkey from Shrek in head here... one of Eddie's best ever roles) the all of this proves one point... the need for good debugger support... this is complex stuff VERY prone to Heisenbugs, this is where a none intrusive debugger comes in...


Please keep your comments related to the post on which you are commenting. No spam, personal attacks, or general nastiness. I will be watching and will delete comments I find irrelevant, offensive and unnecessary.