I’m going to start off the new year with a rant, or to put it better, a tirade. When targeting a new platform, OS, or architecture, there will always be gotchas and unforeseen idiosyncrasies about that platform that you now have to account for. Sometimes they are minor little nits that don’t really matter. Other times they can be of the “Holy crap! You have got to be kidding me!” Then there are are the “Huh!? What were they thinking!?” kind. For the Mac OS running on 32bit x86 hardware, which is what we’ll be supporting initially while we’re still getting our x64 compiler online, we encountered just that sort of thing. I’m probably going to embarrass myself here. I’ve worked with the x86 CPU architecture for at least 25 years, so I would hope that I’ve got a pretty good handle on it. I also work with a bunch of really smart people that have the same, if not more experience, with the x86 and and other RISC/CISC architectures. These are not just folks that have worked with those architectures, but create compilers and tools for them, including back-ends that have to generate machine instructions. I don’t know, but I’d venture a guess that you have to have more than a cursory knowledge of the given CPU architecture.
So what did we find that would prompt me to go on a tirade? It seems that the MacOS ABI* requires that just prior to executing a CALL instruction, we must ensure that the stack is aligned to a 16byte (Quad DWORD) boundary. This means that when control is transferred to a given function, the value in ESP is always going to be $xxxxxxxC. Prior to the call, it must be $xxxxxxx0. Note that Windows or even Linux doesn’t have this requirement. OK? The next question is “Why!?” Let’s examine several potential scenarios or explanations. I, along with my colleagues here at Embarcadero (and even past Borland/CodeGear colleagues that now literally work at Apple) have yet to have this explained to our satisfaction. We even know one that even works at Apple on the Cocoa R&D team in the OS group! Our own Chris Bensen, has even visited these friends for lunch at Apple and posed the question.
By now you’re either thinking I’ve gone around the bend, or what is the big deal? Others are probably thinking, “Well that makes sense because modern CPUs work better if the stack is aligned like that.” Here are some various reasons we’ve both come up with ourselves and explanations we’ve been given. They all tend to be variations on a theme but none have truly been satisfactory. Why burden every function in the system to adhere to this requirement for some (in the grand scheme) lesser used instructions.
“The Mac OS uses a lot of SSE instructions”
Yes, there are SSE instructions that do require that all memory data be aligned on 16 byte boundaries. I know that. I also know that many CPU caches are 16 bytes wide. However, unless you are actually using an SSE instruction (and face it, most functions will probably never actually use SSE instructions). What I do know about alignments is that for a given machine data size (1, 2, 4, 8, 16 bytes), they should always be aligned to their own natural boundary for maximum performance. This also ensures that a memory access doesn’t cross a cache line, which is certainly more expensive.
But why does my function have to make sure your stack is aligned? What I mean is that if a compiler (or even hand coded assembly) needs to have some local variable or parameter aligned on the stack, why doesn’t the target function ensure that? I refer you to the title of this post for my feeling on this one. If you need it aligned, then align it yourself.
“The Mac OS intermixes 64bit and 32bit code all over the place”
I’ve heard this one a lot. Yes, x86-64 does have stricter data alignment requirements, but intermixing of the code? Does it? Really? Not within a given process, AFAIK. When you call into the OS kernel, the CPU mode is switched. Due to the design of 64bit capable CPUs, you cannot really execute 64bit code and 32bit code within the same process. And even if the kernel call did cause a mode switch and used the same stack, I again, refer you to the title of this post. Admittedly, I don’t know all the gory details of how the Mac OS handles these 32bit<->64bit transitions. I would imagine that they would have to go through a call gate since the ring has to change along with the “bitness” mode. This will also typically cause a switch to a different “kernel stack” which would also copy a given number of items from the user’s stack. This is all part of the call descriptor.
“It simplifies the code gen”
I see. So having to inject extra CPU instructions at each call site to ensure that the stack is aligned properly is simpler!? You could argue that the compiler’s code generator has to keep track of the stack depth and offsets anyway, so this is minimal extra overhead. But that’s my point, why burden every function with this level of housekeeping when it is not necessary for the current function to operate correctly?
“Maybe it was because the Mac OS X initially targeted the PowerPC?”
When Mac OS X was first introduced, the whole Macintosh line of computers had already transitioned from the Motorola 680xx line of CPUs to the PowerPC RISC CPU. When Apple decided to completely switch all its hardware over to the Intel x86 and x86-64 architectures, it is possible (and unless I can find information to the contrary) and indeed probable, that the insular nature of the Apple culture directly lead to a vast misunderstanding of this aspect of the 32bit x86 architecture. Failure to actually look at other very successful Intel x86 operating systems and architectures, such as, oh.. I don’t know… Windows and Linux?
I guess the frustrating thing about all this is that 32bit x86 code generated for the Mac will have extra overhead that is clearly not necessary or even desired for other platforms, such as Windows or Linux. This is like requiring all my neighbors to keep my house clean. Sure, if your compiler is going to do some kind of global or profile guided optimization, you may want to do more stack manipulations through out the application. But that is a highly specialized and rare case, and AFAIK, the tools on the platform don’t do anything like that (GCC, ObjectiveC).
When we first found out about this requirement among the dearth of documentation on these low-level details (I’m sorry, but Apple’s overall OS documentation pales in comparison to Windows’ or even Linux’s. Mac fans, flame on ;-), I posed a question to Stack Overflow figuring that since there is such a wide range of experts out there, that surely a clearly satisfactory explanation would be available in short order. I was wrong. That question has been up there for over 9 months and I still get up-votes periodically.
Even though there is one answer I selected, it doesn’t seem that the Mac ABI even adheres to the Intel doc! "It is important to ensure that the stack frame is aligned to a 16-byte boundary upon function entry to keep local __m128 data, parameters, and XMM register spill locations aligned throughout a function invocation." It says, “upon function entry” yet this isn’t the case. The Mac ABI requires the alignment to be aligned at the call site, and not on function entry! Remember that the CALL instruction will automatically push the return address onto the stack, which for x86-32 is a 32bit (DWORD) sized. Why isn’t the stack merely prepared to be aligned at the call site so that when the return address is pushed, the stack is now aligned. This would mean that ESP would be $xxxxxxx4 at the call site and $xxxxxxx0 upon entry. It is also possible interpret this statement that the function prologue code is what is responsible for doing this alignment, and not necessarily the caller. This would clearly jive with the title of this post.
So there you have, a long rambling diatribe. Why does this even matter if the compiler just handles it? Because we’re having to go through all our optimized, hand coded assembly code and make sure it keeps the stack properly aligned. It also means that for all our customers out there that also like to dabble in hand coding assembler will need to now take this into account. This coupled with the re-emergence of Position Independent Code (PIC), we’re having a jolly old time… Let the ensuing discussion begin… I’m really interested in knowing the actual reasons for this requirement… I mean could the really smart OS and Tools folks at Apple gotten this all wrong? I really have a hard time believing that because you’d think that someone would have caught it. Yet, seemingly obvious software bugs sneak out all the time (yeah, yeah, we’re just as guilty here, you got me on that one!).
Pre-emptive snarky comment: Well, if you had a compiler that did better optimizations you wouldn’t have these problems! Thank you, Captain Oblivious! Our code generator is a trade-off between decent optimizations and very fast codegen. We only do the biggest-bang-for-the-buck optimizations. We’ve even added some optimizations in recent history, like function inlining.
*ABI – Application Binary Interface. This is different from an API (Application Programming Interface), which only defines the programming model of a group of provided functions. An ABI defines the raw mechanics of how an API is implemented. An API may define a calling convention that is used, whereas the ABI defines exactly how that convention is to actually work. This includes things like how and where are the parameters actually passed to the function. It also defines which CPU registers must be preserved, which ones are considered “working” registers and can be overwritten, and so on.
Seems pretty straightfoward to me. Mac OSX is a grab bag of every OS Apple could lay their hands on when it became clear they weren't qualified to write their own OS any longer. They shoehorned it all into a single box and smeared a wedge of legacy app apis all over the top. How could it possible EVER be sane? What amazes me is that the OS works at all some days. (apple fans -> you hate me, yes, we get it...)
ReplyDeleteI'm gonna ignore the whole hand optimized assembler business as too complicated to cover briefly.
It's a shame the PowerPC processors had so many production problems Apple got tired and ended up switching to the x86 CISC architecture. I've read PowerPC programming manuals back then, and they had a much cleaner instruction set than the arcane, irregular x86 instructions. Better factories and procution processes have beaten the best architecture.
ReplyDeleteAnyway, eagerly waiting for Delphi for the Mac.
Best Reagards
So, will there be a VCL for Mac or Linux?
ReplyDeleteJust to get a better understanding of this :
ReplyDeleteIn what cases could the 16-byte CALL alignment be ignored? Would it be possible to do this alignment only when calling code NOT generated by your own compiler? (Exported functions need to be taken into account there of course). Or am I just too dumb to understand the full reach of your ranting? ;-)
I think it is a good idea to have a look to FreePascal. They run a proper Mac compiler for a long time. The concepts are brilliant, I think.
ReplyDeleteThanks Allen, I like this post and share your feelings with respect to this matter.
ReplyDeleteJust one thing, regarding your comment that an ESP=..0h alignment once _inside_ the function would make sense to help align local data for aligned SIMD access - it seems that this also isn't the case on the x64 systems, where RSP =..0h at the _call_ site, and RSP = ..8h once inside the function.
@Patrick - the alignment could be ignored only if the called function (AND any function called by[..] any function called by it) is known to definitely NOT assume that there is such an alignment (e.g. for aligned SIMD access) AND if the function is known to never ever go through a gatekeeper that enforces the alignment.
I'm sad to hear the OS/X support is causing you so much grief, and will only be 32 bit. At the Apple Developer conference I attended in 2008, they showed that something like 76% (a high value at any rate) of their users were running the latest OS release. With the low cost of their updates and an on average boost of 20% performance according to Apple, I assume the trend has continued with Snow Leopard.
ReplyDeleteSince 10.6 is 64 bit, Embarcadero might be severely limiting the users of both the tool and any developed products by only supporting 32 bit apps. You might even get a few less tire kickers.
Here's hoping the 64 bit compiler is coming soon....
@Larry,
ReplyDeleteIt is interesting to note that while 10.6 (SnowLeopard) is 64bit capable, it looks like for most Mac hardware you have to jump through some hoops to actually boot the 64bit kernel. I found this interesting article: http://lowendmac.com/musings/09mm/64-bit-snow-leopard.html#64
It seems to say that you may not be getting the "bitness" for the kernel you expected by default. However, it does apparently allow you to run a 64bit process even if the kernel is 32bit. Quite different than Windows in that regard.
@Allen:
ReplyDeleteI just read your new post on this topic and realised that there is indeed an extra overhead in case of functions without any stack arguments.
> Presumably because otherwise my function may have to jump
ReplyDelete> through hoops to align the stack (with sub/and), as
> opposed to you merely having to adjust the size of the
> stack frame that you have to allocate anyway.
Again, I refer you to the title of my post. And what about functions that don't need to have a stack frame, parameters passed on the stack (ie. register based parameters)? This still makes these functions responsible for noodling around with alignment.
> a) If your function performs a call: "push" should not be
> used on Mac OS X to put parameters on the stack, but
> instead a parameter area should be reserved on function
> entry as part of the stack frame setup. The parameters
> can then be stored into this area using regular mov-style > instructions, which can be faster on modern x86
> processors because unlike push they do not depend on each
> other in any way. Arguably, Linux and Windows systems
> could therefore also benefit from such a change.
This is one I was aware of and, yes it is most likely faster. It does tend to make the alignment issue less prominent, in and of itself isn’t a reason for requiring alignment be done the way it currently is.
> Again, I refer you to the title of my post.
ReplyDeleteWell, there's obviously a trade-off/choice that has been made, which has both upsides and downsides.
> And what about functions that don’t need to have a
> stack frame, parameters passed on the stack (ie.
> register based parameters)?
Since the default Mac OS X i386 ABI does not use use register parameters, I presume (with the stress on "presume") that this does not occur very often with the standard OS compilers. Even with register parameters, I'm actually not quite sure about how often it happens that a non-leaf function has absolutely no stack frame at all (i.e., no saved non-volatile registers other than ebp, no local variables or other stack-allocated temps such as spilled registers, and no stack parameters passed to callees).
> This still makes these
> functions responsible for noodling around with alignment.
True, but I think these functions are a fairly small minority (although I'd have to measure to be sure), probably much smaller than the total number of functions that use SSE and require a 16 byte aligned stack in all Mac OS X frameworks combined (especially if you reduce the window to speed-critical routines).
Also, regarding your comment about the optimised move() procedure from the FastCode project in your other post: you might want to benchmark it against the default Mac OS X memcpy. That routine, just like memset and a couple of others, resides in kernel memory that's mapped into user space and these routines will always be a version that's optimised for the cpu in the current machine (and hence can also transparently improve when new kernels and/or machines are released).
> Also, regarding your comment about the optimised move()
ReplyDelete> procedure from the FastCode project in your other post: you
> might want to benchmark it against the default Mac OS X
> memcpy. That routine, just like memset and a couple of
> others, resides in kernel memory that’s mapped into user
> space and these routines will always be a version that’s
> optimised for the cpu in the current machine (and hence can
> also transparently improve when new kernels and/or machines
> are released).
That is an excellent suggestion. I'll mention it to the team. The Mac OS is pretty new to many of us here and little bit of information like this are mostly gleened from other developers rather than the documentation.
Small correction: technically, you have to use memmove, not memcpy, since memcpy is not guaranteed to treat overlapping source and destinations correctly. In all versions of Mac OS X until now memmove and memcpy have mapped to the same code and hence both do the right thing under all circumstances, but that could of course change at any time.
ReplyDelete[...] 33 is necessary due to the ABI requirement that the stack must be aligned to a 16 byte boundary (I tend to agree with Allen on the stupidity of this [...]
ReplyDeleteI've found the link again regarding the use of stubs for PIC prologs to which I referred at the end of my comment 9 above: http://lists.apple.com/archives/perfoptimization-dev/2007/Nov/msg00005.html
ReplyDeletedue to the ABI requirement that the stack must be aligned to a
ReplyDeletegreat!! I solved the problem about the code thunk on Mac with this article. thx!!
ReplyDeleteWhats an ABI?
ReplyDelete