I'm sure most, if not all, of you have heard the phrase "Why reinvent the wheel?" Another one I'm very fond of is "Standing on the shoulders of giants." The latter was often heard coming from Distinguished Engineer, Anders Hejlsberg during his tenure at Borland. So, why reinvent the wheel when there are giant shoulders to stand on ;-)?
If you've been following along for the last year or so, I've been working on something I loosely refer to as the Delphi Parallel Library. While I've not been blogging about the DPL for a while, I have been going back to it from time to time as I gain more knowledge and more information into the fascinating subject of parallelism and task dispatching. There is a veritable wellspring of information flooding the internet on this subject, so keeping up on it has been a daunting task. One excellent source of information is Joe Duffy the lead for the Task Parallel Libary team at Microsoft. Another is an interesting project at MIT called Cilk (that's pronounced "silk"). There is a spin-off company that is set to deliver on the work started from the Cilk project, called Cilk Arts. There is the Thread Building Blocks library from Intel. Finally, there are many smaller upstart projects, open source and otherwise out there. The point is that this is a hot topic.
One of the interesting facets about all of the projects I mentioned above, is that none of these libraries are really about "threading." They're about breaking down iterative algorithms into smaller and smaller chunks of work or "tasks" and then working on ways to execute as many of these tasks in parallel on separate cores. They're about scalability and performance. They try and minimize the overhead associated with traditional threading models. As I get deeper into creating a Delphi Parallel Library, the plan is to cull together some of the key concepts from all the existing research and libraries out there and use them as a "blueprint" for the DPL. I want to take these concepts and apply them in a "Delphi-like" manner that would be familiar and intuitive to the Delphi programmer while leveraging as much of the newer Delphi language features such as Generics and Anonymous methods. Joe Duffy's series about creating a Work Stealing Queue and combining it with a thread pool in which forms the basis for the .NET TPL is an excellent source of information. This whole notion of distributed work queues and work stealing is also something that the Cilk project has also done. A lot of the current research on parallelism also relies heavily on the latest in lock-free programming and algorithms. An excellent source for various lock-free lists, stacks and queues for the Delphi & C# folks is Julian Bucknall's blog.
The best way to learn is to actually do. While playing with some of these ideas and concepts, I've found that I really need to have an easy way to do some simple performance measuring. In .NET, there is an interesting little class, System.Diagnostics.Stopwatch. As I begin to lay the foundation, some of these little utility classes will be heavily used. Why not provide a stopwatch gizmo for Delphi? While the stopwatch is just a thin wrapper around the QueryPerformanceCounter() and QueryPerformanceFrequency() APIs, it provides a neatly encapsulated way to start, stop, reset, for measuring elapsed time. When I looked at the various use-cases for the stopwatch, it became clear that implementing this thing as a Delphi class would be require more housekeeping overhead than I wanted. Instead, I opted for doing this as a record with methods. Sorry, no Generics or Anonymous Methods in this one :-). Here's a Delphi version of the declaration:
TStopwatch = record
strict private
FElapsed: Int64;
FRunning: Boolean;
FStartTimeStamp: Int64;
function GetElapsedMilliseconds: Int64;
function GetElapsedTicks: Int64;
function GetElapsedSeconds: Double;
class procedure InitStopwatchType; static;
public
class function Create: TStopwatch; static;
class function GetTimeStamp: Int64; static;
procedure Reset;
procedure Start;
class function StartNew: TStopwatch; static;
procedure Stop;
property ElapsedMilliseconds: Int64 read GetElapsedMilliseconds;
property ElapsedTicks: Int64 read GetElapsedTicks;
property ElapsedSeconds: Double read GetElapsedSeconds;
property IsRunning: Boolean read FRunning;
public class var Frequency: Int64;
public class var IsHighResolution: Boolean;
end;
Here's a simple use case using Barry's little benchmarker class (only the actual benchmarking functions):
class function TBenchmarker.Benchmark(const Code: TProc; Iterations,
Warmups: Integer): Double;
var
SW: TStopwatch;
i: Integer;
begin
for i := 1 to Warmups do
Code;
SW := TStopwatch.StartNew;
for i := 1 to Iterations do
Code;
SW.Stop;
Result := SW.ElapsedSeconds / Iterations;
end;
function TBenchmarker.Benchmark<T>(const Name: string; const Code: TFunc<T>): T;
var
SW: TStopwatch;
i: Integer;
begin
for i := 1 to FWarmups do
Result := Code;
SW := TStopwatch.StartNew;
for i := 1 to FIterations do
Result := Code;
SW.Stop;
FReportSink(Name, SW.ElapsedSeconds / Iterations - FOverhead);
end;
It's minor, yes, but it does reduce some of the thinking about calculating the time interval and having to declare a start and stop local variable. Also, since it is a record that is allocated on the stack, there are no heap allocations and there is no need to do any cleanup. Although, I admit that it would be nice to have a type constructor in order to better handle the initialization of the IsHighResolution and Frequency class variables.
And now the implementation:
class function TStopwatch.Create: TStopwatch;
begin
InitStopwatchType;
Result.Reset;
end;
function TStopwatch.GetElapsedMilliseconds: Int64;
begin
Result := GetElapsedTicks div (Frequency div 1000);
end;
function TStopwatch.GetElapsedSeconds: Double;
begin
Result := ElapsedTicks / Frequency;
end;
function TStopwatch.GetElapsedTicks: Int64;
begin
Result := FElapsed;
if FRunning then
Result := Result + GetTimeStamp - FStartTimeStamp;
end;
class function TStopwatch.GetTimeStamp: Int64;
begin
if IsHighResolution then
QueryPerformanceCounter(Result)
else
Result := GetTickCount * 1000;
end;
class procedure TStopwatch.InitStopwatchType;
begin
if Frequency = 0 then
begin
IsHighResolution := QueryPerformanceFrequency(Frequency);
if not IsHighResolution then
Frequency := 1000;
end;
end;
procedure TStopwatch.Reset;
begin
FElapsed := 0;
FRunning := False;
FStartTimeStamp := 0;
end;
procedure TStopwatch.Start;
begin
if not FRunning then
begin
FStartTimeStamp := GetTimeStamp;
FRunning := True;
end;
end;
class function TStopwatch.StartNew: TStopwatch;
begin
InitStopwatchType;
Result.Reset;
Result.Start;
end;
procedure TStopwatch.Stop;
begin
if FRunning then
begin
FElapsed := FElapsed + GetTimeStamp - FStartTimeStamp;
FRunning := False;
end;
end;
Now here's a question I've been unable to get a definite answer to: When would QueryPerformanceCounter and QueryPerformanceFrequency ever fail on a standard Windows system? Does this only occur for things like WinCE, Embedded NT, or systems with strange HAL layers? Even though I've taken into account that, according to the documentation, it could fail, in practice when is that?
Just take a look also in fork/join Java API
ReplyDeleteParallelism with Fork/Join in Java 7
http://www.infoq.com/news/2008/03/fork_join
Concurrency JSR-166 Interest Site
http://gee.cs.oswego.edu/dl/concurrency-interest/
El Gi,
ReplyDeleteThanks for the links. Things in Java-land are looking very much like .NET and vice-versa.
Allen.
a favorite saying of mine too but it originated with Sir Isaac Newton in 1676:
ReplyDelete"If I have seen further it is by standing on the shoulders of giants."
(the best part was his humility..."If I have seen further"...)
great blog. thank you!
Mike,
ReplyDeleteI didn't meant to indicate that Anders originated that comment, but merely his use of the phrase.
And, it is a great comment because that is really what we all do. We always build on the knowledge obtained by those that have gone before us.
Allen.
Xepol,
ReplyDeleteIn the most technical sense, you're right. A CPU level bus lock is still a lock of sorts. In this case "lock" is referring to the heavy OS kernel-level locking primitives. How about "quick, lightweight, OS-lock free programming"?
:-)
Allen.
The Newton quote is quite fascinating, it was likely meant as an insult to a "competitor" of Newton, Robert Hook. (Newton wrote it a letter to Robert Hooke).
ReplyDeleteFrom Wikipedia:
"Historians generally think the above quote was an attack on Hooke (who was short and hunchbacked), rather than – or in addition to – a statement of modesty. The two were in a dispute over optical discoveries at the time. The latter interpretation also fits with many of his other disputes over his discoveries – such as the question of who discovered calculus as discussed above."
I did a similar thing like the TStopwatch using an interface IStopWatch and a global function returning such an interface. Internally this is implemented by a reference counted object. As an advanatage I see the encapsulation of the intrinsics and the availabliliy to use it even with COM. Do you see any drawbacks?
ReplyDeleteUwe,
ReplyDeleteOther than the added overhead of the reference counting and memory allocations, I can't think of any major drawbacks. If you've got something that works and serves your needs, I see no reason to abandon it.
Allen.
I don't know of any situation where the calls fail, but sometimes I see a big jump when using QueryPerformanceCounter, so I usually run tests more than once to rule out anomalies.
ReplyDeleteBruce, do you see the 'big jump' on a multi-cpu machine? That's a 'feature' AFAIK.
ReplyDeleteYou can fill your static class variables in the initialization section of the enclosing unit. Of course that will be run upon program start but a QueryPerformanceFrequency shouldn't matter ;)
ReplyDeleteDaniel,
ReplyDelete"You can fill your static class variables in the initialization section of the enclosing unit"
Of course, however I didn't want to implicitly reference the type just by having the unit in my uses clause. By only doing the work on demand, I can ensure that the smart-linker will only bring in the requisite code if I actually *used* the type. This is where I'd like to have class constructors for classes and records in Win32. They're only available in .NET right now.
Allen.
Bruce, Joe,
ReplyDeleteYou're probably seeing these "jumps" because the performance counters continue to run even when they're executing other threads/processes on the system. It is a single counter per-CPU. So it could be that your thread is running on a different core.
Allen.
@Allen (and Barry :) )
ReplyDeleteClass Constructors - I really would love those.
I would also like class constants in a similar fashion to class variables. The type can be explicit or implied.
TMyClass = class(TObject)
class const Key = 1; unique; // optional, compile-time enforce to ensure uniqueness
class const WorkPoolSize = 8;
end;
TMyClassSub = class(TMyClass)
class const Key = 2; override;
end;
Thanks, Allen. I'll try setting processor affinity for benchmarking code and see if that helps.
ReplyDeleteIn similar code of my own I hide the API behind a pair of function pointers:
ReplyDeletetype
TInt64Fn = function: Int64;
vars
GetTimerValue: TInt64Fn = NIL;
GetTimerFreq: TInt64Fn = NIL;
these are in a unit that has a little initialization code - if HPC is supported, these vars are hooked up directly to the HPC API functions. If not then they are hooked up to my own routines - one that returns the result of GetTickCount and the other that simply returns 1000, respectively.
The testing for HPC support is performed just once, automatically, and I simply write my code that uses whatever underlying timer is available using GetTimerValue and GetTimerFreq.
As for if/when HPC might not be supported... I have no idea but just as you, have always allowed for the possibility.
:)
General observation:
"Reinventing the wheel" is hopelessly over-used imho.
Nobody *EVER* re-invented a wheel.
What does happen - a lot - is that a new TYPE of wheel, and sometimes even a new axle and/or bearing, is required to be designed to suit the specific engineering problem to which a wheel provides the general solution.
To mix-in another metaphor:
The peg has already been invented - why re-invent the peg?
The answer is obvious if the peg is square and the hole you need to fit it in is round.
it was actually Sir Isac Newton who once said 3 things (all apparently related to this post)
ReplyDelete1. action -> reaction
2. i've seen further climbing on giants's shoulders
3. i've just taken a tea spoon from an ocean
good luck a!
d
One thing I have wondered is what practical uses multicore/multithreading can be put to within a single application.
ReplyDeleteMy list is very small.
Doesn't mean I'm not interested!
What failures of QueryPerformanceCounter is concerned, it has to do which underlying hardware clock Windows uses. In some cases when Windows decides to use the TSC, there can be problems on multicore systems especially on some types of AMD CPUs. Since on some CPUs each core has its own TSC, sometimes the TSC is read form one core sometimes form the other (and they may not be in sync depending on power saving states).
ReplyDeleteSee aslo here:
http://support.microsoft.com/?scid=kb%3Ben-us%3B895980&x=12&y=14
Ken:
ReplyDeleteI'm building a game engine that can run scripts. I run each script on its own instance of the script engine, each one inside its own thread, because there are certain scripts that can take a noticeable amount of time to complete, especially if they have to wait on input from the user.