Friday, February 22, 2008

Thread pools Task handling.

The Delphi Parallel Library (DPL) tends to occupy a lot of my otherwise idle "thought" time. DPL has been redesigned in my head more times than I can count.  The good news is that it is conceptually beginning to actually take some shape and is finding a direction.  As I've stated before, a lot of the concepts behind the DPL are taken from Intel's TBB and Microsoft's TPL.

When I started down this road, it all began as an investigation into various ways of implementing a thread pool.  It turns out that this simplistic mechanism really doesn't truly address parallelism and concurrent processing.  It is merely a way of shoveling off a task into another thread, which may or may not execute concurrently.  A thread pool doesn't do anything to "load-balance" the work in a constructive manner, nor does it purport to.  That is left as an exercise for the user.

While reading and researching all the various approaches to handling parallelism, a few light bulbs started coming on.  I've been approaching this whole thing from the wrong angle.  I'd been trying to shoehorn task handling (a task being the smallest unit of work that can be made parallel) concepts into the more general notion of a thread pool.  I admit that sometimes I'm not the sharpest knife in the drawer, but I can recognize when one approach is superior to another.

Rather than starting with the concept of thread pools, a parallel library should really begin at the most basic level, the task or just a unit of work.  A task is merely some chunk of code that may or may not execute concurrently.  Another thing to remember is that a parallel library is about getting these tasks done as fast as absolutely possible.  It is also about keeping as many of the CPU cores as busy as possible with as little idle time and overhead.  It is not about fairness. It is about speed.

To best do this, the internal scheduling and management of tasks should control how they are scheduled and processed on the CPU cores.  Remember, we're talking about the actual real, hardware CPU core (or virtual core, like Hyperthreading).  Not a thread.  A thread is merely a way to easily segregate tasks among the cores.  We'll rely on the operating system to make sure that of the available cores, then there are equally as many "threads" actually burning CPU cycles for real work.  Any other threads are stuck waiting for their chance at a core.  Because of this, a task manager should endeavor to keep the number of running threads as close to equal to the number of available cores.  It should also equally handle 1 or 100 cores by making sure as many cores are a busy as possible.

Using a task based approach also requires that the threading and scheduling is completely (or nearly completely) hidden.  The user should never think of threads.  The user only deals with Tasks and the extra abstractions built on top of them.  So it looks like DPL is headed for another rewrite, which is good because each time I do this it is only getting better. I've not spent a whole lot of time on the current iteration as it stands, so this isn't much of a setback. 

Microsoft's TPL is one of the few things to come out in recent years that looks great and performs well.  The concepts behind it are very well suited for Delphi.  To get an idea of some of the things you can do with it (and eventually the DPL ;-) you really should watch this video.  The really interesting bits come on toward the last 1/2 to 1/3 of the video where they go into the idea behind work stealing and work load balancing.  Really, really interesting stuff.  Joe Duffy also addresses how you need to think about parallelism and how it applies to your situation.

As part of this, I just ordered Joe's soon to be released book on Concurrent Programming on Windows Vista.  You can actually buy the book online ahead of the actual print release and download a PDF of the work in progress until then.  From reading only a few of the online excerpts, it looks to be one of those "must have" tomes for anyone doing any kind of parallel programming in native Windows and in .NET.  Even though a lot of the information I know, it is good to have it codified in a consistent organized manner.  As I stated in my first post about thread pools, as I dig deeper into this whole subject, I want to make sure that Delphi developers at all levels will find some value in such a library.  The best thing you can do at this point is to learn as much about these concepts as you can.  Hopefully this blog and others I've been referencing and will reference in the future will be an excellent source of information.  I'm finding that as I write about this, it give me a chance to review my assumptions and design ideas with a critical eye.  Hopefully, we can all learn something along the way as well.  I know I have so far.


  1. did you notice this new product. Sounds like exactly what you are talking about...

  2. Re: "Because of this, a task manager should endeavor to keep the number of running threads as close to equal to the number of available cores. It should also equally handle 1 or 100 cores by making sure as many cores are a busy as possible."

    Well, true, as long as tasks are CPU intensive. Sometimes more is better - I do have some applications where I tune the thread pool to allow up to twice the number of CPUs number of running threads (I hope this is decipherable). The measurements showed that this is better then more executing a smaller number of threads.

    So I hope the DPL will have the CPU load factor configurable ...

  3. Gabr,

    That is the whole point. Parellelism is about making CPU intensive tasks faster by dividing them among the available cores. If your "tasks" spend most of the time blocked in I/O, then this is probably not the mechanism you'd want to use. A traditional thread pool or even manually spinning up threads would be more apropos.


  4. OK, I can accept that argument. Still, I think this decision massively decreases the usefuleness of the DPL. If I look at the projects I'm currently managing, I can't see one that can benefit from an only-CPU-intensive paralelization.

    But then, this may be caused by the hardware-related stuff I'm doing and other people may have totally different needs.

  5. Allen this is good stuff and very important. I think you would be doing the Delphi world a great service if you could get something like this coded up in WIN32 Delphi. Presumably Microsofts TPL contains many hidden details though, for example that it measures processor utilisation on a multi-core system before deciding which cores to give work to.

  6. I realize its not a 1 for 1 trade off, but I'm concerned that we're putting the cart before the horse here with the DPL. I am convinced there are issues with Delphi running on multi processor machines, we have a number of customers that have had to force processor affinity to 1 CPU due to a random lockup that happens during shut down of services. Our services use database access (tsqlconnection) and dll's, unfortunately I haven't been able to isolate the actual issue but seems to be happening in threadexit.
    I've put OutputdebugStrings messages into every spot of the code I could. After all the finalization sections run, it goes into ntdll.dll. The output window shows a ThreadExit message. If doesn't hang it starts to unload the libraries, if it hangs then thats it - no more messages. It seems like ntdll goes into a loop, but I can't figure out what it is doing. My guess is that 1 cpu destroys / locks something, and the 2nd cpu ends up dead locked. There is a QC that may be related (#47559) that details an issue with allocatehwnd.

  7. 1. Agree with gabr. Multithreading is not always for CPU-load only. Sometimes it's for better program logic. Even with a single CPU. The toolkit shoud target this task as well.

    2. Are there any plans to use the new functionality in the other VCL places: bakground loading of datasets, their blob parts and nested datasets in DataSnap? Modern web-based AJAX applications sometimis look even more responsive than classic Delphi ones. Will we see asynchronous events (OnClick, etc.) supported at the VCL core level?

  8. "A standard threadpool would be much better for that."

    A nice TThreadPool wouldn't hurt then, would it? ;-)

  9. Maxim,
    1. DPL is not about multithreading, it is about parallelism. If you need multithreading, thread-pools or spinning your own may best for your application.

    2. Once we have the library far enough along, we'll begin to look for places that could benefit from its use. For async events, the DPL would not be the mechanism of choice. A standard threadpool would be much better for that.


  10. Wow, I see I've just commented on Allen's post almost 6 hours before he's posted it. It's not a race condition, is it? :-)

  11. Hi Allen,
    Yes, nice work - these task libraries are nice things. But, imho, the main challenge, after providing all the required stuff for parallel programming (locks, transactional memory, message passing infrastructure, task libraries) will be to provide the necessary tooling
    for the developer to manage the increased complexity of programming in manycore environments. Imho, a visual approach like
    will help developers to master the inter-thread/task coordination.

  12. I fully support this line of inquiry! Not a bad way to spend your idle cycles. Keep going Allen!

  13. Ultimately Allen, I think this functionality should be built-in at the compiler level and activated by a switch, when a programmer chooses to allow his program to attempt to utilize all available cores as efficiently as possible. I don't really see this as a programming semantic at the VCL level, the more that I think about it.


  14. Dennis,

    Yes, I agree. However, until we've got a good foundation in the libraries, that is rather like building the kitchen before you've laid the foundation. I also agree that this is not specific to what one would think of as VCL. It is at a much lower-level than that.


  15. Allen,

    DPL sounds pretty interesting. If it can give us the benefits of TBB/TPL then that would be quite fabulous.

    I've just been at SD west and Herb Sutter (and others) talked quite a bit about the double checked locking pattern and why multi-threaded cache coherency issues can break it on some platforms. The solution for C++0x is the new atomic keyword. As Herb explained at some length this should really be thought of as ensuring that access to a variable is to be *ordered* atomic.

    It seems to me that such a keyword would have great value in Delphi. The advantages that spring to mind:

    1. Solving the problems with double checked locking and cache coherency.
    2. Removing the need for programmers to use hand-crafted calls to InterlockedXXXX routines for lock-free memory access. If this could work for 8 byte wide variables as well as 4 byte wide ones then that would be something else.

    This would really need to be built into the language and I can see that it is non-trivial. Probably the emitting of LOCK instructions and erection of memory fences isn't too hard. But the memory manager may need to have extended functionality to arrange that targets for the atomic keyword were properly aligned.

    Personally I wouldn't mind if it were left to me to make sure that my variables were byte aligned (I already know how to do that and the combination of FastMM and Delphi alignment directives allow me to do it).

    I'd be very interested to hear what your plans are in this area. Obviously locks are still very useful, but for many problems lock-free is critical. As the concurrency revolution continues lock-free is going to become more important too I suspect as newer architectures support even faster LOCKed operations.

    Thanks, David.

  16. Would it be possible to have the main page of your blog flow the way the individual posting/comments page does? I read on my iPhone and it's too awkward on the main page. Try reading both in a browser window 480 pixels wide to see what I mean.


  17. Hi

    It's been a year since this post. What is the situation on this library?


  18. Hi,

    I tried to use your library and it works quite nice; I started something simpler before, but your tricks with nested functions are super.
    First I want to report some bugs: NestedFor and NestedForWithStride use the highbound as length! the highbound value is not passed. For example, the ParallelLoop has an internal Len which is properly calculated (I didn't test it though).
    I would suggest to correct them (or rename parameters if it was intended).

    Other improvements ideas:
    - modify events to use "const", to avoid copying
    - use for TWorkerData and co records, it should be faster (no additional initializations...)

    Do you have an updated version?


  19. I`m sorry to bother you in this post.. but..
    The vcl library that covers win32api controls didnt change from the version 5 of delphi and this is just like a pain in the a** ;-)


  20. I just want to emphasizes to David Heffernan's comment.
    The low level implementation of the RTL does rely on the LOCK asm instruction, in the heap manager (FastMM4) and for strings and dynamic arrays. The reference count is made with a x86 LOCK instruction, which freezes all cores in order to synchronize the reference count access for all threads.
    Even with the DPL, you will loose most benefit of multi-core CPU if the RTL is not modified.


Please keep your comments related to the post on which you are commenting. No spam, personal attacks, or general nastiness. I will be watching and will delete comments I find irrelevant, offensive and unnecessary.