Eric published a simple test case to demonstrate what he was talking about. I took that test case and began to dig into it. First of all I looked into what Windows was doing in the critical section code down in the kernel. A quick spelunk with the CPU view was rather revealing. In this specific case, the critical section wasn’t ever actually blocking on a kernel-level object. In other words it was doing a busy-wait. Under the high-contention test case that Eric presented, the busy-wait was the magic that served to blow TMonitor out of the water! As a baseline here’s my result with Eric’s app unchanged:
Once I determined that a spin-lock was in play, it all became much clearer. TMonitor can (and always has) optionally do a busy-wait before blocking. You must manually set the spin count on a specific TMonitor instance by calling TMonitor.SetSpinCount(Obj, nnnn). So, the first thing I did was to add the following to the FormCreate() event handler:
System.TMonitor.SetSpinCount(Self, 100000);
With that single change, suddenly things got a whole lot better:
We’re on the right track here… but not perfect. I experimented with various spin counts, but above 1000 there were diminishing returns. So, something else must be in play here. I decided to experiment with a raw spin-lock. Luckily Delphi already has one, System.SyncObjs.TSpinLock, to be exact. Instead of using a TMonitor, I changed the code to use a TSpinLock. Whoa! Look at these results:
Now we’re gettin' somewhere! Hmmm… so what is the difference here? Setting the spin count on TMonitor should be roughly equivalent to the raw TSpinLock… so looked at the code for each and then it hit me! Exponential backoff! Duh! The TSpinLock implements an exponential backoff that will yield or sleep every X iterations.** TMonitor doesn’t implement an exponential backoff for it’s busy-wait implementation. What is interesting is that the TMonitor does have an internal TSpinWait record that does implement exponential backoff. So I made a very small modification to TMonitor to now use the TSpinWait record within the TMonitor.Enter(Timeout: Cardinal); method within the System unit.
Added the following local var declaration:
var
...
SpinWait: TSpinWait;
Added/Removed the following lines
...
if SpinCount > 0 then
begin
StartCount := GetTickCount;
SpinWait.Reset; // Added
while SpinCount > 0 do
...
...
YieldProcessor; // Removed
SpinWait.SpinCycle; // Added
Dec(SpinCount);
// Keep trying until the spin count expires
end;
Now here’s the final result:
Hmm… I’d say that’s darn near the same. While I cannot say for certain, but I imagine the difference in the graphic patterns could be attributable to a difference in the exponential backoff algorithm and parameters. I also reduced the TMonitor spin count to 500 and got the same results. On my machine anything over 500 didn’t make any difference. Below about 250, it started doing kernel-level blocks and performance dropped.
With just that one simple change, TMonitor is matching and even sometimes now exceeding the performance of the Windows critical section. On the whole, I’d declare them equivalent. This does, however, mean you will need to explicitly set the spin count on the TMonitor to some value. Normally the “right” value is a guess based on empirical testing for your specific use case. I suspect the Windows critical section code selects some value based on the number of cores and CPU speed. At some point, I'll need to research an algorithm that can automatically select a reasonable spin count.
There is one thing I can say that TMonitor has over the Windows critical section… it works on all platforms (Windows, MacOS, iOS, and soon Android). Feel free to use whichever one you feel more comfortable using for your specific situation. Me? I will fully admit I’m biased…
*Yes, there is a field test going on and if you already have XE4 or purchase XE4, you become eligible to get access. You can go here to request access.
**This is not the same as blocking, it only forces the current thread to relinquish it’s current quanta (time slice) to allow another thread to run. This only adds thread-switching overhead without the added kernel object blocking overhead.
Nice to see the monitor getting some love!
ReplyDeleteTBH most of the annoyance with it comes from it being a "compulsory" overhead, yet not worthy of it. If it had been a peripheral class in some other units, I would have been happy to ignore it.
As for the comments, they're closed automatically after a few weeks to minimize the administrative task of spam filtering (I'm seeing some very clever spam bots these days, I'll probably post about them some day). Anyway old posts dont get much traffic, so new comments on old posts have very low visibility, so you're better off blogging. (and I've edited the article to link yours).
Great !
ReplyDeleteNow while TMonitor is getting some attention, could these QC reports be taken care of as well:
QC111795 Race-condition in TMonitor code causing application to hang on exit. Link: http://qc.embarcadero.com/wc/qcmain.aspx?d=111795
and same issue :
QC99234 Process hangs on shutdown after using TMonitor to wait on an object that isn't freed. Link:http://qc.embarcadero.com/wc/qcmain.aspx?d=99324
Will that make it in the Android release?
ReplyDeleteLooks good, but I have to wonder. If setting up a spinlock improves performance so much, why make it so we have to configure it? Why not give TMonitor a spinlock by default? If the value has to be adjusted empirically, put a global variable in the System unit and initialize it to 500 (or whatever) and let the coders adjust it if necessary, but it looks like even a default first-guess approximation will be significantly better than the current default...
ReplyDeleteMason,
ReplyDeleteYou have an excellent point and that is something to consider. I did mention the possibility of doing something like that. Your idea could be a reasonable compromise.
Eric
ReplyDeleteMost of the "overhead" only comes in once you start using an object's monitor. The TMonitor record is only allocated on demand. There is only 4(8) bytes reserved in the instance. IMO, being cross device/platform and intrinsic to an instance justifies the added 4 (8 on 64bit) bytes.
Leif,
ReplyDeleteIn all those cases, the application hasn't cleanly shutdown by leaking an object with a monitor active. That being said, there have been other recent changes pertaining to the Android release which have necessitated the need to address this issue. A change to the monitor support logic in System.SysUtils has been done to now only wait for about 1ms for each "left-over" sync object to be cleaned up. After 1ms, the sync object is forcibly destroyed. Yes, this *can* lead to other shutdown failures.
Jeroen,
ReplyDeleteYes, this will be in the next release. I also intentionally posted the description of the patch so that adventurous folks could apply that to their copy of System.pas and rebuilt it. This patch should work at least back to XE, IIRC.
The overwhelming majority of objects won't ever need to be locked (shouldn't) or make use of the services of the monitor, that's not just megabytes being wasted, but also as an overhead during object cleanup.
ReplyDeleteGranted, there are worse long-standing issues in that area (the horrid redundant initialization sequence for interfaced objects or the RTTI-based cleanup for entities with managed fields come to mind, they were justified in an era when executable sizes were measured in kilobytes), but it all adds up, and due to the bugs in TMonitor, up until XE5 (assuming the other QCs pointed are fixed and no other bugs are introduced), this was pure overhead with zero justification since the bugs meant it couldn't be relied upon.
Also I'm not convinced that for such aspects one shouldn't rely on the OS implementations, especially for cross-platform: there are too many possible evolutions to properly handle all thread-related peculiarities (from emerging architectures like the ARM hybrid ones, hypervisor considerations for server-side executions, new CPU instructions, new micro schedulers like in Win8, etc.). Your code is going to be outdated, while the OS'es will evolve and adapt to the hardware.
It's a bit like early DOS apps that relied on loops and CPU frequency assumptions to measure time. There are some low-level aspects where you're better off taking a "hands off" approach IME and rely on what's provided (unless you have lots of time to pour into it, and lots of money to buy all the hardware variations to test on).
I also recommend a spin lock by default. In my source, I have a Locker wrapped around a Critical Section which I use for D5-DXE3 projects and it has a default spinlock of 3800.
ReplyDeleteThis was derived from MSDN docs:
http://msdn.microsoft.com/en-us/library/windows/desktop/ms686197(v=vs.85).aspx
"You can improve performance significantly by choosing a small spin count for a critical section of short duration. The heap manager uses a spin count of roughly 4000 for its per-heap critical sections. This gives great performance and scalability in almost all worst-case scenarios."
"The overwhelming majority of objects won’t ever need to be locked (shouldn’t) or make use of the services of the monitor, that’s not just megabytes being wasted, but also as an overhead during object cleanup."
ReplyDeleteWhile you're correct that most objects won't ever need to be locked, the notion that the memory overhead is "megabytes" is a bit hyperbolic, IMO. The "overhead" during object cleanup consists of a nil check.
"Granted, there are worse long-standing issues in that area (the horrid redundant initialization sequence for interfaced objects or the RTTI-based cleanup for entities with managed fields come to mind, they were justified in an era when executable sizes were measured in kilobytes), but it all adds up, and due to the bugs in TMonitor, up until XE5 (assuming the other QCs pointed are fixed and no other bugs are introduced), this was pure overhead with zero justification since the bugs meant it couldn’t be relied upon."
That's kinda contradictory no? In the first paragraph you were complaining about the added 4(8) byte overhead as being too much, then you proceed to advocate for increasing the code size. Sure, one is the code and the other is data, but they all take up process space. What bugs in TMonitor? I know of no "bugs" at this point. Most of the legitimate bugs were resolved a couple releases ago. This latest change is more performance tuning than bug. Also, this change only effects high-contention scenarios, for which I would also contend that other techniques should be relied upon rather than high-contention locks.
As for relying on an OS implementation, that is of course your choice. There are many folks who would rather use the cross-device/platform features to minimize IFDEFs or potential semantic differences.
"That’s kinda contradictory no? In the first paragraph you were complaining about the added 4(8) byte overhead as being too much, then you proceed to advocate for increasing the code size. Sure, one is the code and the other is data, but they all take up process space."
ReplyDeleteThe 4(8) byte overhead is per object, whereas the hypothetical class cleanup overhead would be constant. Also, I'm not sure if that would actually turn into a real increase in code size.
If you replaced the vmtInitTable (which contains data referencing each managed member of a class instance) with a pointer to a compiler-generated method (which contains code referencing each managed member of a class instance) the total byte size would probably not be all that different, but the overhead from eliminating all those calls to FinalizeArray and FinalizeRecord would be very noticeable in some cases...
Argh. The *reduction in* overhead would be noticeable, I mean.
ReplyDelete*grumbles at blog engines that don't allow edits*
That's not contradictory, but I didn't explicit one aspect: for the 4/8 bytes overhead, it's can't be avoided, while the initialization/finalization one can be, though it requires trickery. I've been using instance-cloning (rather than contructor) and pooling to minimize them on key classes.
ReplyDeleteAs for the overhead, the megabytes really isn't an hyperbole, I tend to have lots (I mean lots) of simple classes instances around, each with very limited state, that are used for their virtual methods (think variants, just with virtual methods and thus infinetely extensible behaviors rather than static case..of jump tables). The most critical of these were "packed" to stay below the size tresholds of FastMM, which the Monitor field happily tripped.
In less specific cases, the overhead is still measurable in DWS f.i. (where there is a hack to reuse the field to hold reference count, but the location of the monitor field means memory is recovered at a slight performance tradeoff, so it's not applicable for every case).
About cross-platform, what prevents TCriticalSection to be cross-platform? Critical sections exists in all OSes. So no "ifdef" needed if the RTL takes care of that. Or do you mean that TMonitor is the going to be the only cross-platform threading classes provided in Delphi?
As for what qualifies as a bug, well, there are many nuances which one can dance around, but the case you discuss in this article is a bug, as it makes Monitor unfit for high-performance multi-threading purposes (and if you take out the "high performance", then you don't need a monitor in the first place, as any kind of bare bones locking and queueing will not just be suitable, but easier to maintain as well).
When buying a Ferrari, if its engine can't take it over 50 mph, then that engine has a problem, even though it "works" since it got you to 49 mph.
I must admit I'm a bit surprised you had to debug the API call to find that critical sections spin. Heck it has even been documented on MSDN for quite some time[1].
ReplyDeleteAnyway, we only just upgraded to XE3 at work so I haven't really looked into this whole TMonitor thing. But what prevents you from simply making it a wrapper over the optimal native lock type? Say a POSIX mutex for Linux, OSSpinLock for OSX and a critical section for Windows, and something similarly suitable for iOS and Android?
That's what we use for our high contention "mutex wrapper" in a high-performance multi-platform C++ app.
[1]: http://msdn.microsoft.com/en-us/library/windows/desktop/ms682530.aspx
"I must admit I’m a bit surprised you had to debug the API call to find that critical sections spin. Heck it has even been documented on MSDN for quite some time[1]."
ReplyDeleteI am fully aware of the fact that critical sections support spin waits, however none of the documentation says that it is enabled by default and to what value. In fact, in the reference you cite, it talks about how you, the developer, are responsible for enabling the spin-wait. What I found was that this behavior now seems to be the default; at least on my version of Windows 7.
Ever since the TMonitor was introduced, it has always supported a spin-wait that the user could specify, just like the critical section. The key difference, and what I did find out, was the use of an exponential backoff algorithm to inject yields and sleeps while spinning, rather than just spinning.
While most OS's support their own notion of a critical section or mutex, I've found that the performance of them differ wildly between platforms and versions of said platform. Additionally, TMonitor is more than a critical section alternative; it is a full implementation of a "monitor" as described by C.A.R Hoare and Per Brinch Hansen. A monitor is the combination of a mutual-exclusion lock and a condition variable. My post included a link to the Wikipedia article describing the data structure and its uses. In fact, the the source comments in the System unit make it clear that the TMonitor intended to be such a data structure.
"however none of the documentation says that it is enabled by default and to what value"
ReplyDeleteFair enough. Albeit I didn't know which value (I wasn't very concerned with this) I thought it was common knowledge that it spinned by default since at least NT4 days, and that the spin lock part was what separated it from a regular mutex. It's quite possible I've been mistaken.
"While most OS’s support their own notion of a critical section or mutex, I’ve found that the performance of them differ wildly between platforms and versions of said platform."
Yes, that is our experience too. Which is why we did some benchmarks and found the optimal "lock primitive" to use on our target platforms. Though in our case we tend to have very short-lived critical sections, I guess it's less predictable in the monitor case.
Anyway nice find about the expoential backoff.
I bumped into the TMonitor performance issue myself just a few weeks ago. One of our customers has been experiencing "non-responsive" delays of 20-60 seconds in his application, when some COM clients were simply disconnected (from this COM server application).
ReplyDeleteI eventually pinpointed the problem to be a TThreadList (or the TMonitor used internally by it), which was used to keep around 5000 objects. At disconnect these objects need to be removed one by one from the list to ensure the clean up is done correctly for each one. I had to change the TThreadList to plain TList guarded with a TCriticalSection, which should speed up the disconnection.
I have not got the confirmation from the customer that it works better yet, but it's good to get a confirmation that there really is a problem of this magnitude with TMonitor - which TThreadList is unfortunately using internally, instead of TCriticalSection.
FWIW, as far as default spin count... it's an older article, but the default spin count was documented on MSDN to be 0 so the concept that it was 'common knowledge' to spin by default is definitely debatable.
ReplyDeletehttp://msdn.microsoft.com/en-us/magazine/cc164040.aspx
"FWIW, as far as default spin count… it’s an older article, but the default spin count was documented on MSDN to be 0 so the concept that it was ‘common knowledge’ to spin by default is definitely debatable."
ReplyDeleteThanks for the link. I've read that article and it certainly does say the default is 0.
As an aside, Matt Pietrek was one of the many people who interviewed me right before I accepted the position at Borland, many years ago. He used to work on the debugger kernel team. I specifically remember that interview because he had me do an interesting mental exercise of designing a CPU with varying supervisor levels (aka. "rings" in Intel x86 parlance). I suppose he liked my answer :)
Allen Bauer , I'm sorry, me now as an aside.
ReplyDeleteWhat do you think about QC103951 or the post https://forums.embarcadero.com/message.jspa?messageID=588015#588015
I guess I'll have to fire up NT4 in a VM to figure this one out :)
ReplyDeleteQC99324 now that we are at it.
ReplyDelete(unfreed sync objects can hang on shutdown)
It is easy to workaround, but finding out what is going on is quite hard.
LLVM coding standards tell us why XE5 android apps are that large and therefore cause black screens at first start while unpacking:
ReplyDeleteDo not use RTTI or Exceptions
In an effort to reduce code and executable size, LLVM does not use RTTI (e.g. dynamic_cast;) or exceptions. These two language features violate the general C++ principle of “you only pay for what you use”, causing executable bloat even if exceptions are never used in the code base, or if RTTI is never used for a class. Because of this, we turn them off globally in the code.
That said, LLVM does make extensive use of a hand-rolled form of RTTI that use templates like isa, cast, and dyn_cast. This form of RTTI is opt-in and can be added to any class. It is also substantially more efficient than dynamic_cast.
http://llvm.org/docs/CodingStandards.html#do-not-use-rtti-or-exceptions
And in delphi there was a way to disable rtti (because actually only a few scenarios use rtti at all... but they don't work in xe5 android anymore... please include an option to disable all rtti and let the compiler compile only used code...
“you only pay for what you use”
ReplyDeleteRTTI C.A.R.(every object has one more pointer) ARC.
all of these violate the principle.
(
And because of ARC. All the TList in classes.pas had to be replaced by TList, and classes.pas in Windows had to be bloat. It's grief.
)
I think they just built in the so-called NEW FEATURE.
They just use the simplest way but not the best way.
What i do not understood is actually how you are not actually a lot more well-liked than you what I have really enjoyed reading your blog posts. Any way I’ll be subscribing to your feed now, tay be right now. You are very intelligent. You understand therefore considerably in the case of this matter, produced me individually believe it from numerous varied angles. Its like Pretty good post. I just stumbled upon your blog and wanted to say thanks.
ReplyDelete