Wednesday, April 7, 2010

“Talk Amongst Yourselves” #3

So far we’ve had “Testing synchronization primitives” and “Writing a ‘self-monitoring’ thread-pool.” Let’s build on those topics, and discuss what to do with exceptions that occur within a scheduled work item within a thread pool.

My view is that exceptions should be caught and held for later inspection, or re-raised at some synchronization point. What do you think should happen to the exceptions? Should they silently disappear, tear-down the entire application, or should some mechanism be in place to allow the programmer to decide what to do with them?

Friday, March 26, 2010

Another installment of “Talk Amongst Yourselves”

Let’s start thinking about thread pools. How do you manage a general purpose thread pool in the face of no-so-well-written-code? For instance, a task dispatched into the thread pool never returns, effectively locking that thread from ever being recycled. How do you monitor this? How long do you wait before spooling out a new thread? Do you keep a “monitor thread” that periodically checks if a thread has been running longer than some (tunable) value? What are the various techniques for addressing this problem?

So, there you go... Talk amongst yourselves.

This is the last day…

In this office. I’ve been in the same physical office for nearly 15 years. After years of accumulation, it now looks positively barren. Beginning next Monday, March 29th, 2010, I’ll be in a new building, new location, and new office. The good thing is that the new place is a mere stone’s throw from the current one. It will be great to leave all the Borland ghosts behind.

Monday, March 22, 2010

Simple question… very hard answer… Talk amongst yourselves…

I’m going to try a completely different approach to this post. I’ll post a question and simply let the discussion ensue. I would even encourage the discussion to spill over to the public newsgroups/forums. Question for today is:

How can you effectively unit-test synchronization primitives for correctness or more generally, how would you test a concurrency library?

Let’s see how far we can get down this rabbit hole ;-).

Tuesday, February 23, 2010

A Happy Accident and a Silly Accident

By now you’re all aware that we’re getting ready to move to a new building here in Scotts Valley. This process is giving us a chance to clean out our offices and during all these archeological expeditions, some lost artifacts are being (re)discovered. Note the following:

These are some bookends that my father made for me within the first year after moving my family to California to work on the Turbo Pascal team. He made these at least two years before Delphi was released, and at a few 6 months before we even began work on it in earnest. Certainly before the codename “Delphi” was ever thought of. I suppose they are my “happy” accident.
This next one is just sad. I received this award at the 2004 Borcon in San Jose from, then Borland President/CEO, Dale Fuller. My title at that time was “Principal Architect”… Of course I like to think that I have strong principles, and maybe that was what they were trying to say… Within a week or so after I got this plaque, another one arrived with the correct spelling of my title. I keep this one just for the sheer hilarity of it. Also, it is a big chunk of heavy marble, so maybe one day I can use to to create a small marble topped table…

Friday, February 19, 2010

What. The. Heck.

Is. This? I simply cannot explain this. At. All.

This was on a bulletin/white-board in the break area. I’d never noticed it because it was covered with photos from various sign-off (final authorization to release the product) celebrations. Lots of photos of both past and present co-workers, many thinner and with more hair ;-). Since we’re in the process of cleaning up in the preparation for moving to our new digs, it is interesting what you find… I presume this image has been on this whiteboard since… I guess… Delphi 5 or is that Delphi S? Either someone has a very odd sense of humor… or, more likely, beer had been involved during one of those sign-off celebrations from the photos. Then again, maybe this whiteboard had been in the Borland board room and this was from a corporate strategy meeting… nah, gotta be the beer.
Ow, my head hurts now…

Tuesday, February 16, 2010

A case when FreeAndNil is your enemy

It seems that my previous post about FreeAndNil sparked a little controversy. Some of you jumped right on board and flat agreed with my assertion. Others took a very defensive approach. Still others, kept an “arms-length” view. Actually, the whole discussion in the comments was very enjoyable to read. There were some very excellent cases on both sides. Whether or not you agreed with my assertion, it was very clear that an example of why I felt the need to make that post was in order.

I wanted to include an example in my first draft of the original post, but I felt that it would come across as too contrived. This time, instead of including some contrived hunk of code that only serves to cloud the issue at hand, I’m going to try a narrative approach and let the reader decide if this is something they need to consider. I may fall flat on my face with this, but I want to try and be as descriptive as I can without the code itself getting in the way. It’s an experiment. Since many of my readers are, presumably, Delphi or C++Builder developers and have some working knowledge of the VCL framework, I will try and present some of the problems and potential solutions in terms of the services that VCL provides.

To start off, the most common case I’ve seen where FreeAndNil can lead to strange behaviors or even memory leaks is when you have a component with a object reference field that is allocated “lazily.” What I mean is that you decide you don’t need burn the memory this object takes up all the time so you leave the field nil and don’t create the instance in the constructor. You rely on the fact that it is nil to know that you need to allocate it. This may seem like the perfect case where you should use FreeAndNil! That is in-fact the very problem. There are cases where you should FreeAndNil in this scenario. The scenario I’m about to describe is not such a case.

If you recall from the previous post, I was specifically referring to using FreeAndNil in the destructor. This is where a very careful dance has to happen. A common scenario in VCL code is to hold references to other component from a given component. Because you are holding a reference there is a built-in mechanism that allows you coordinate the interactions between the components by knowing when a given component is being destroyed. There is the Notification virtual method you can override to know if the component being destroyed is the one to which you have a reference. The general pattern here is to simply nil out your reference.

The problem comes in when you decide that you need to grab some more information out of the object while it is in the throes of destruction. This is where things get dangerous. Just the act of referencing the instance can have dire consequences. Where this can actually cause a memory leak was if the field, property, or method accessed caused the object to lazily allocate that instance I just talked about above. What if the code to destroy that instance was already run in the destructor by the time the Notification method was called? Now you’ve just allocated an instance which has no way to be freed. It’s a leak. It’s also a case where a nil field will never actually cause a crash because you were sooo careful to check for nil and allocate the field if needed. You’ve traded a crash for a memory leak. I’ll let you decide whether or not that is right for your case. My opinion is that leak or crash, it is simply not good design to access an instance that is in the process of being destroyed.

“Oh, I never do that!” That’s probably true, however what about the user’s of your component? Do they understand the internal workings of your component and know that accessing the instance while it is in the throes of destruction is bad? What if it “worked” in v1 of your component and v2 changed some of the internals? Do they even know that the the instance is being destroyed? Luckily, VCL has provided a solution to this by way of the ComponentState. Before the destructor is called that starts the whole destruction process, the virtual method, BeforeDestruction is called which sets the csDestroying flag. This can now be used as a cue for any given component instance whether or not it is being destroyed.

While my post indicting FreeAndNil as not being your friend may have come across as a blanket statement decrying its wanton use, I was clearly not articulating as well as I should that blindly using FreeAndNil without understanding the consequences of its effect on the system as a whole, is likely to bite you. My above example is but one case where you should be very careful about accessing an object in the process of destruction. My point was that using FreeAndNil can sometimes appear to solve the actual problem, when in fact if has merely traded it for another, more insidious, hard to find problem. A problem that doesn’t bite immediately.

Friday, February 5, 2010

A case against FreeAndNil

I really like the whole idea behind Stackoverflow. I regularly read and contribute where I can. However, I’ve seen a somewhat disturbing trend among a lot of the answers for Delphi related questions. Many questions ask (to the effect) “why does this destructor crash when I call it?” Invariably, someone would post an answer with the seemingly magical incantation of “You should use FreeAndNil to destroy all your embedded objects.” Then the one asking the question chooses that answer as the accepted one and posts a comment thanking them for their incredible insight.

The problem with that is that many seem to use FreeAndNil as some magic bullet that will slay that mysterious crash dragon. If using FreeAndNil() in the destructor seems to solve a crash or other memory corruption problems, then you should be digging deeper into the real cause. When I see this, the first question I ask is, why is the instance field being accessed after that instance was destroyed? That typically points to a design problem.

FreeAndNil itself isn’t the culprit here. There are plenty of cases where the use of FreeAndNil is appropriate. Mainly for those cases where one object uses internal objects, ephemerally. One common scenario is where you have a TWinControl component that wraps some external Windows control. Many times some control features can only be enabled/disabled by setting style bits during the creation of the handle. To change a feature like this, you have to destroy and recreate the handle. There may be some information that is stored down on the Windows control side which needs to be preserved. So you grab that information out of the handle prior to destroying and park that data in an object instance field. When the handle is then created again, the object can look at that field and if it is non-nil, it knows there was some cached or pre-loaded data available. This data is then read and pushed back out to the handle. Finally the instance can then be freed by FreeAndNil(). This way, when the destructor for the control runs you can simply use the normal “FCachedData.Free;” pattern since Free implies a nil check.

Of course there is no hard-and-fast rule that says you should not use FreeAndNil() in a destructor, but that little “fix” could be pointing out that some redesigning and refactoring may be in order.

Friday, January 29, 2010

There may be a silver lining after all

After having to deal with all the stack alignment issues surrounding our move to target the Mac OS, I’d started to fear that I would get more and more jaded cynical about the idiosyncrasies of  this new (to many of us) OS. I was pleased to hear from Eli that he’d found something that, at least to a compiler/RTL/system level software type person, renews my faith that someone at Apple on the OS team seems to be “on the ball.”

Apparently, the Mac OS handles dynamic libraries in a very sane and reasonable manner. Even if it is poorly (and that is an understatement) documented, there is at the very least some open-source portions of the OS that allows the actual code to be examined for how it really works (which is totally different that what any of the documentation says). At least in this regard Linux is the one that is way behind Mac OS and Windows.

Tuesday, January 26, 2010

Requiem for the {$STRINGCHECKS xx} directive…

It’s time. It’s time to say goodbye to the extra behind-the-scenes codegen and overhead that was brought to us during the Ansi->Unicode transition. We’ve shipped two versions with this directive on by default. The Ansi world is now behind us. It’s only real purpose in life was to assist C++Builder customers to more easily transition to C++Builder 2009 and 2010. There are some rare cases where an event handler that was declared in a C++ form/datamodule with an AnsiString parameter *could* be called with the AnsiString parameter containing a UnicodeString payload. To guard against this case, since there was no way to detect this at runtime, was to be resilient to it. Agree or not, that was what happened.

Monday, January 25, 2010

Divided and Confused

Odd discovery of the day. Execute the following on a system running a 32-bit version of Windows (NOT a Win64 system!):

program Project1;



    on E: Exception do
      Writeln(E.ClassName, ': ', E.Message);

Friday, January 15, 2010

Mac OS Stack alignment – What you may need to know

While I let my little tirade continue to simmer, I figured many folks’ next question will be “Ok, so there may be something here that affects me. What do I need to do?” Let’s first qualify who this will affect. If you fall into the following categories, then read on:

  • I like to use the built-in assembler to produce some wicked-fast optimized functions (propeller beanie worn at all times…)
  • I have to maintain a library of functions that contain bits of assembler
  • I like to take apart my brand new gadgets just to see what makes them tick (Does my new 802.11n router use an ARM or MIPS CPU?)
  • My brain hasn’t been melted in a while and I thought this would be fun
  • I want to feel better about myself because I don’t really have to think about these kinds of things

Let’s start off with a simple example. Here’s an excerpt of code that lives down in the System.pas unit:

function _GetMem(Size: Integer): Pointer;
JLE @@negativeorzerosize
CALL MemoryManager.GetMem
JZ @@getmemerror
REP RET // Optimization for branch prediction
MOV AL,reOutOfMemory
JMP Error
DB $F3

Notice the CALL MemoryManager.GetMem instruction. Due to the nature of what that call does, we know that it is very likely that a system call could be made so we’re going to have to ensure the stack is aligned according to the Mac OS ABI. Here's that function again with the requisite changes:

function _GetMem(Size: Integer): Pointer;
JLE @@negativeorzerosize
CALL MemoryManager.GetMem
JZ @@getmemerror
REP RET // Optimization for branch prediction
MOV AL,reOutOfMemory
JMP Error
DB $F3

When compiling for the Mac OS, the compiler will define ALIGN_STACK so you know that this compile target requires stack alignment. So how did we come up with the value '12' in which to adjust the stack. If you remember from my previous article, we know that upon entry to the function, the value in ESP should be $xxxxxxxC. Couple that with the fact that up until the actual call, we've done nothing to change the value of ESP, we know where the stack is in the alignment. Since the stack always grows down in memory (toward lower value addresses), we need change ESP to $xxxxxxx0 by subtracting $C, which is 12 decimal. Now the call can be made and we'll know that upon entry to MemoryManager.GetMem, ESP, once again, will be $xxxxxxxC.

That was a relatively trivial example since there was only one call out to an function that may make a system call. Consider a case where MemoryManager.GetMem was just a black-box call and you had no clue what it would actually do. You cannot ever be certain that any given call will not lead to a system call, so the stack needs to be aligned to a known point just in case the system is called.

Another point I need to make here is that if the call goes across a shared library boundary,  even if the shared library is also written in Delphi, you will be making a system call the first time it is invoked. This is because all function imports, like Linux, are late-bound. Upon the first call to the external reference, it will go through the dynamic linker/loader that will resolve the address of the function and back-patch the call site so that the next call goes directly to the imported function.

What happens if the stack is misaligned? This is the most insidious thing about all this. There are only certain points where the stack is actually checked for proper alignment. The just mentioned case where you’re making a cross-library call is probably the most likely place this will be encountered. One of the first things the dynamic linker/loader does is to verify stack alignment. If the stack is not aligned properly, then a mach EXC_BAD_ACCESS exception is thrown (this is different than how exceptions are done in Windows, see Eli’s post related to exception handling). The problem is that the stack alignment could have been thrown off by one function, hundreds of calls back along the call chain! That is really “fun” to track down where it first got misaligned.

Suppose the function above now had a stack frame? What would the align value be in that case? The typical stack frame, assuming no stack-based local variables, would look like this:

function MyAsmFunction: Integer;
{ Body of function }

In this case the stack pointer (ESP) will contain the value, $xxxxxxx8 which is 4 bytes for the return address and 4 bytes for the saved value of EBP. If no other stack changes are made, surrounding any CALL instruction, assuming you’re not pushing arguments onto the stack which we’ll cover in a moment, there would be a SUB ESP,8 and ADD ESP,8 instead of the previous 12.

Now, this is where it gets complicated, which clearly demonstrates why compilers are pretty good at this sort of thing. What if you wanted to call a function from assembler that expected all the arguments too be passed on the stack? Remember that at the call site (ie. just prior to the CALL instruction), the stack must be aligned to a 16 byte boundary and contain $xxxxxxx0. In this case you cannot simply push the parameters on the stack and then do the alignment. You must now align the stack before pushing parameters onto it knowing how the stack will be aligned after all the parameters are pushed. So if I need to push 2 DWORD parameters onto the stack and the current ESP value is $xxxxxxxC, you need to adjust the stack by 4 bytes (SUB ESP,4). ESP will now contain $xxxxxxx8. Then push the two parameters onto the stack which adjusts ESP to $xxxxxxx0, and we’ve satisfied the alignment criterion.

If the previous example had required 3 DWORDS, then no adjustment of the stack would be needed since after pushing 3 DWORDS(that’s 12 bytes), the stack would have been $xxxxxxx0, and we’re aligned. Likewise, if the above example had required 4 DWORD to be pushed, then now we’re literally “wasting” 12 extra bytes of stack. because 4 DWORDS is 16 bytes, that block of data will naturally align, so we have to start pushing the parameters on a 16 byte boundary. That means we’re back to adjusting the stack by the full 12 bytes, pushing 16 bytes onto the stack and then making the call. For a function call taking 16 bytes, we’re actually using 28 bytes of stack space instead of only 16! Add in stack-based local variables and you can see how complicated this can quickly get.

Remember, this is also happening behind the scenes within all your Delphi code. The compiler is constantly keeping track of how the stack is being modified as the code is generated. It then uses this information to know how to generate the proper SUB ESP,ADD ESP instructions. This could mean that code that was deeply recursive that worked fine on Windows, would now possibly blow out the stack on the Mac OS! Yes, this is admittedly a rare case since stacks tend to be fairly large (1MB or more), but it is still something to consider. Consider changing your recursive algorithm to iterative instead in order to keep the stack shallower and cleaner.

You should really consider whether or not your hand-coded assembler function needs to be in assembler and if it would work just as well if it were in Pascal. We’re evaluating this very thing, especially for functions that are not used as often or have been assembly merely due to historical reasons. Like you, we also understand that there is a clear benefit to having a nice hand-optimized bit of assembler. For instance, the Move() function in the System unit was painstakingly optimized by members of the FastCode project. Everyone clearly benefits from the optimization that function provided since it is heavily used throughout the RTL itself, but also by many, many users. Note here that the Move() function required no stack alignment changes since it makes no calls outside its own block of code, so it is just as fast and optimized as before. It runs unchanged on all (x82-32bit) platforms.

Thursday, January 14, 2010

It’s my stack frame, I don’t care about your stack frame!

I’m going to start off the new year with a rant, or to put it better, a tirade. When targeting a new platform, OS, or architecture, there will always be gotchas and unforeseen idiosyncrasies about that platform that you now have to account for. Sometimes they are minor little nits that don’t really matter. Other times they can be of the “Holy crap! You have got to be kidding me!” Then there are are the “Huh!? What were they thinking!?” kind. For the Mac OS running on 32bit x86 hardware, which is what we’ll be supporting initially while we’re still getting our x64 compiler online, we encountered just that sort of thing. I’m probably going to embarrass myself here. I’ve worked with the x86 CPU architecture for at least 25 years, so I would hope that I’ve got a pretty good handle on it. I also work with a bunch of really smart people that have the same, if not more experience, with the x86 and and other RISC/CISC architectures. These are not just folks that have worked with those architectures, but create compilers and tools for them, including back-ends that have to generate machine instructions. I don’t know, but I’d venture a guess that you have to have more than a cursory knowledge of the given CPU architecture.

So what did we find that would prompt me to go on a tirade? It seems that the MacOS ABI* requires that just prior to executing a CALL instruction, we must ensure that the stack is aligned to a 16byte (Quad DWORD) boundary. This means that when control is transferred to a given function, the value in ESP is always going to be $xxxxxxxC. Prior to the call, it must be $xxxxxxx0. Note that Windows or even Linux doesn’t have this requirement. OK? The next question is “Why!?” Let’s examine several potential scenarios or explanations. I, along with my colleagues here at Embarcadero (and even past Borland/CodeGear colleagues that now literally work at Apple) have yet to have this explained to our satisfaction. We even know one that even works at Apple on the Cocoa R&D team in the OS group! Our own Chris Bensen, has even visited these friends for lunch at Apple and posed the question.

By now you’re either thinking I’ve gone around the bend, or what is the big deal? Others are probably thinking, “Well that makes sense because modern CPUs work better if the stack is aligned like that.” Here are some various reasons we’ve both come up with ourselves and explanations we’ve been given. They all tend to be variations on a theme but none have truly been satisfactory.  Why burden every function in the system to adhere to this requirement for some (in the grand scheme) lesser used instructions.

“The Mac OS uses a lot of SSE instructions”

Yes, there are SSE instructions that do require that all memory data be aligned on 16 byte boundaries. I know that. I also know that many CPU caches are 16 bytes wide. However, unless you are actually using an SSE instruction (and face it, most functions will probably never actually use SSE instructions). What I do know about alignments is that for a given machine data size (1, 2, 4, 8, 16 bytes), they should always be aligned to their own natural boundary for maximum performance. This also ensures that a memory access doesn’t cross a cache line, which is certainly more expensive.

But why does my function have to make sure your stack is aligned? What I mean is that if a compiler (or even hand coded assembly) needs to have some local variable or parameter aligned on the stack, why doesn’t the target function ensure that? I refer you to the title of this post for my feeling on this one. If you need it aligned, then align it yourself.

“The Mac OS intermixes 64bit and 32bit code all over the place”

I’ve heard this one a lot. Yes, x86-64 does have stricter data alignment requirements, but intermixing of the code? Does it? Really? Not within a given process, AFAIK. When you call into the OS kernel, the CPU mode is switched. Due to the design of 64bit capable CPUs, you cannot really execute 64bit code and 32bit code within the same process. And even if the kernel call did cause a mode switch and used the same stack, I again, refer you to the title of this post. Admittedly, I don’t know all the gory details of how the Mac OS handles these 32bit<->64bit transitions. I would imagine that they would have to go through a call gate since the ring has to change along with the “bitness” mode. This will also typically cause a switch to a different “kernel stack” which would also copy a given number of items from the user’s stack. This is all part of the call descriptor.

“It simplifies the code gen”

I see. So having to inject extra CPU instructions at each call site to ensure that the stack is aligned properly is simpler!? You could argue that the compiler’s code generator has to keep track of the stack depth and offsets anyway, so this is minimal extra overhead. But that’s my point, why burden every function with this level of housekeeping when it is not necessary for the current function to operate correctly?

“Maybe it was because the Mac OS X initially targeted the PowerPC?”

When Mac OS X was first introduced, the whole Macintosh line of computers had already transitioned from the Motorola 680xx line of CPUs to the PowerPC RISC CPU. When Apple decided to completely switch all its hardware over to the Intel x86 and x86-64 architectures, it is possible (and unless I can find information to the contrary) and indeed probable, that the insular nature of the Apple culture directly lead to a vast misunderstanding of this aspect of the 32bit x86 architecture. Failure to actually look at other very successful Intel x86 operating systems and architectures, such as, oh.. I don’t know…  Windows and Linux?

I guess the frustrating thing about all this is that 32bit x86 code generated for the Mac will have extra overhead that is clearly not necessary or even desired for other platforms, such as Windows or Linux. This is like requiring all my neighbors to keep my house clean. Sure, if your compiler is going to do some kind of global or profile guided optimization, you may want to do more stack manipulations through out the application. But that is a highly specialized and rare case, and AFAIK, the tools on the platform don’t do anything like that (GCC, ObjectiveC).

When we first found out about this requirement among the dearth of documentation on these low-level details (I’m sorry, but Apple’s overall OS documentation pales in comparison to Windows’ or even Linux’s. Mac fans, flame on ;-), I posed a question to Stack Overflow figuring that since there is such a wide range of experts out there, that surely a clearly satisfactory explanation would be available in short order. I was wrong. That question has been up there for over 9 months and I still get up-votes periodically.

Even though there is one answer I selected, it doesn’t seem that the Mac ABI even adheres to the Intel doc! "It is important to ensure that the stack frame is aligned to a 16-byte boundary upon function entry to keep local __m128 data, parameters, and XMM register spill locations aligned throughout a function invocation." It says, “upon function entry” yet this isn’t the case. The Mac ABI requires the alignment to be aligned at the call site, and not on function entry! Remember that the CALL instruction will automatically push the return address onto the stack, which for x86-32 is a 32bit (DWORD) sized. Why isn’t the stack merely prepared to be aligned at the call site so that when the return address is pushed, the stack is now aligned. This would mean that ESP would be $xxxxxxx4 at the call site and $xxxxxxx0 upon entry. It is also possible interpret this statement that the function prologue code is what is responsible for doing this alignment, and not necessarily the caller. This would clearly jive with the title of this post.

So there you have, a long rambling diatribe. Why does this even matter if the compiler just handles it? Because we’re having to go through all our optimized, hand coded assembly code and make sure it keeps the stack properly aligned. It also means that for all our customers out there that also like to dabble in hand coding assembler will need to now take this into account. This coupled with the re-emergence of Position Independent Code (PIC), we’re having a jolly old time… Let the ensuing discussion begin… I’m really interested in knowing the actual reasons for this requirement… I mean could the really smart OS and Tools folks at Apple gotten this all wrong? I really have a hard time believing that because you’d think that someone would have caught it. Yet, seemingly obvious software bugs sneak out all the time (yeah, yeah, we’re just as guilty here, you got me on that one!).

Pre-emptive snarky comment: Well, if you had a compiler that did better optimizations you wouldn’t have these problems! Thank you, Captain Oblivious! Our code generator is a trade-off between decent optimizations and very fast codegen. We only do the biggest-bang-for-the-buck optimizations. We’ve even added some optimizations in recent history, like function inlining.

*ABI – Application Binary Interface. This is different from an API (Application Programming Interface), which only defines the programming model of a group of provided functions. An ABI defines the raw mechanics of how an API is implemented. An API may define a calling convention that is used, whereas the ABI defines exactly how that convention is to actually work. This includes things like how and where are the parameters actually passed to the function. It also defines which CPU registers must be preserved, which ones are considered “working” registers and can be overwritten, and so on.