Wednesday, October 31, 2007

I cut and I cut and it was still too short!

In keeping with the notion that I'm highly fallible and flawed human like so many others I just have to relay an important epiphany I had while playing some more with thread pools, parallelization of loops, and using a different thread pool technique. Actually, more than a week ago I'd planned on writing up this simply simply stellar posting about how "simple" paralleling can be done and the overall results of some "research" I'd been doing (you know, the 'R' in R&D). I'd even posted a little challenge for folks in this message. However, I'd actually spent the last few weeks becoming even more follicularly challenged.

I've discovered a crucial bit about how you divide a task up into smaller chunks for dispatching into a thread pool. The technique you use for chunking the task is highly dependent on the nature of the task itself [and the crowd sighs with a huge DUH!!!]. Obviously your algorithm cannot have interdependencies between the iterations, directly or indirectly. I got that one. Here's what I discovered.

I took my thread pool class and wrote a ParallelFor function that takes a low and high bound and a method pointer (this will become a generic function once we have Win32 generics fully functional). The routine would divide up the range into smaller "chunks" based on the number of processors available and then dispatch these smaller chunks into the thread pool. It would then block on a list of events until they were all signaled. This meant that all the sub-tasks were complete and the function can return. Works great!

Now I needed a sufficiently long running algorithm that also had the trait where each iteration is independent from all the others. I also wanted one that folks were familiar with. I've got it, the Sieve of Eratosthenes! Simple, iterative algorithm, right? Or so I thought. The first step was to code it up in a single-threaded manner and test it, get a base-line on the performance, and make sure I got the algorithm correct. Easy. Check. Now I need to figure out how to use my new handy-dandy "Ginsu Knife" ParallelFor procedure. So the sieve consists of an outer loop and an inner loop. Just pass the outer loop range to ParallelFor and then in the method callback perform the inner loop. Simple. I whipped this all up into my test application where I'd done some very simple "QueryPerformanceCounter()" profiling. I already knew the baseline performance of the algorithm, so I was all ready to see both cores on my "Yonah" based Dell D820 light up. You know that sinking feeling in your gut when you're all ready for something big and when it gets to the moment, it is a huge let-down? That's just what happened... The performance was nearly identical to the single-threaded version!

I spent the next couple of weeks on and off doing things like rewriting my thread-pool using this interesting technique referred to me by Chee Wee. I adjusted the number of chunks used for my ParallelFor function. Disabled SpeedStep on my machine. Verified that both cores were, in fact, being exercised. I wrote a little assembler function using the cpuid instruction to identify each core and make sure I'm seeing some distribution between the cores. Yep, both cores were lighting up... but why wasn't the performance any better??? Logic would dictate that it should. I had conservative expectations. Anywhere between 15 and 30% improvement would be great. Close to 50% would be perfect. I made sure the array was aligned... wait that doesn't matter since it was an array of bytes. It's mostly unaligned in that case. So I made it an array of Integers... nope virtually no performance change! Back to array of Byte. Maybe I need to increase the range of primes to calculate since there is some overhead in the ParallelFor procedure. So I ultimately increases it to 1million. Took longer, sure. But it was still the same performance as the single threaded version!!! By now I certain I'm destined to be immortalized for all to see on the the DailyWTF

Ok, time to drag out the big guns! So I fired up my trusty AQTime. After a couple of runs, I started noticing something interesting. Of the two threads being dispatched into, one was eating significantly more CPU cycles than the other. Why was that?? I rechecked the chunking in my ParallelFor function... nope it evenly divided the range up just fine. (Ok, the much more astute among you have long ago figured out my error) Finally, I went back and analyzed the sieve algorithm... Suddenly I realized what the problem was.

If you look at how the Sieve of Eratosthenes actually works, it doesn't really calculate which numbers are prime but rather which ones are not prime. When it is done, each element of the array that is not marked, are the prime numbers. So as the algorithm runs it is figuring all the numbers that have the current iteration's value as a factor. For instance, in a given finite range there are far more numbers that have 2 as a factor than say, 10. By chunking the algorithm in even linear chunks, the lower range has to do more work than the higher range.

The solution in this case was to provide a way to interleave the work among multiple threads such that each thread takes a fair share of the high-factor lower integers. Once I rewrote the ParallelFor algorithm to do this and retested the application I was greeted with a performance improvement very close to 50%! I'd finally got each core equally lit up!

In conclusion, based on this information, it is very clear that the way you break apart an algorithm is highly dependent on the nature of that algorithm. In the case of the Sieve of Eratosthenes, an interleaving algorithm is the optimal way to make it parallel. However, in the case of, say, doing an alpha-blend over a large, in memory image, the larger linear chunks may work just fine. Suppose you did have an algorithm that did have dependencies on adjacent iterations? In this case, the large linear chunk version may work better because you only need to worry about the edges. However, even in that case, such an algorithm is in danger of degrading back to being serial because you cannot calculate iteration 100 before you finish iterations 1-99. Sometimes you can minimize these interdependencies at the expense of using more memory.

Another interesting algorithm to make parallel is Conway's Game of Life. In this case you calculate the new generation based on a static version of the previous generation. I think I'll tackle that one and see if interleaving or chunking makes any difference. I'll be back with a report. This multi-core/parallel processing stuff is interesting, challenging, and overall forces you to think a little differently about your problem space.