Thursday, January 14, 2010

It’s my stack frame, I don’t care about your stack frame!

I’m going to start off the new year with a rant, or to put it better, a tirade. When targeting a new platform, OS, or architecture, there will always be gotchas and unforeseen idiosyncrasies about that platform that you now have to account for. Sometimes they are minor little nits that don’t really matter. Other times they can be of the “Holy crap! You have got to be kidding me!” Then there are are the “Huh!? What were they thinking!?” kind. For the Mac OS running on 32bit x86 hardware, which is what we’ll be supporting initially while we’re still getting our x64 compiler online, we encountered just that sort of thing. I’m probably going to embarrass myself here. I’ve worked with the x86 CPU architecture for at least 25 years, so I would hope that I’ve got a pretty good handle on it. I also work with a bunch of really smart people that have the same, if not more experience, with the x86 and and other RISC/CISC architectures. These are not just folks that have worked with those architectures, but create compilers and tools for them, including back-ends that have to generate machine instructions. I don’t know, but I’d venture a guess that you have to have more than a cursory knowledge of the given CPU architecture.

So what did we find that would prompt me to go on a tirade? It seems that the MacOS ABI* requires that just prior to executing a CALL instruction, we must ensure that the stack is aligned to a 16byte (Quad DWORD) boundary. This means that when control is transferred to a given function, the value in ESP is always going to be $xxxxxxxC. Prior to the call, it must be $xxxxxxx0. Note that Windows or even Linux doesn’t have this requirement. OK? The next question is “Why!?” Let’s examine several potential scenarios or explanations. I, along with my colleagues here at Embarcadero (and even past Borland/CodeGear colleagues that now literally work at Apple) have yet to have this explained to our satisfaction. We even know one that even works at Apple on the Cocoa R&D team in the OS group! Our own Chris Bensen, has even visited these friends for lunch at Apple and posed the question.

By now you’re either thinking I’ve gone around the bend, or what is the big deal? Others are probably thinking, “Well that makes sense because modern CPUs work better if the stack is aligned like that.” Here are some various reasons we’ve both come up with ourselves and explanations we’ve been given. They all tend to be variations on a theme but none have truly been satisfactory.  Why burden every function in the system to adhere to this requirement for some (in the grand scheme) lesser used instructions.

“The Mac OS uses a lot of SSE instructions”

Yes, there are SSE instructions that do require that all memory data be aligned on 16 byte boundaries. I know that. I also know that many CPU caches are 16 bytes wide. However, unless you are actually using an SSE instruction (and face it, most functions will probably never actually use SSE instructions). What I do know about alignments is that for a given machine data size (1, 2, 4, 8, 16 bytes), they should always be aligned to their own natural boundary for maximum performance. This also ensures that a memory access doesn’t cross a cache line, which is certainly more expensive.

But why does my function have to make sure your stack is aligned? What I mean is that if a compiler (or even hand coded assembly) needs to have some local variable or parameter aligned on the stack, why doesn’t the target function ensure that? I refer you to the title of this post for my feeling on this one. If you need it aligned, then align it yourself.

“The Mac OS intermixes 64bit and 32bit code all over the place”

I’ve heard this one a lot. Yes, x86-64 does have stricter data alignment requirements, but intermixing of the code? Does it? Really? Not within a given process, AFAIK. When you call into the OS kernel, the CPU mode is switched. Due to the design of 64bit capable CPUs, you cannot really execute 64bit code and 32bit code within the same process. And even if the kernel call did cause a mode switch and used the same stack, I again, refer you to the title of this post. Admittedly, I don’t know all the gory details of how the Mac OS handles these 32bit<->64bit transitions. I would imagine that they would have to go through a call gate since the ring has to change along with the “bitness” mode. This will also typically cause a switch to a different “kernel stack” which would also copy a given number of items from the user’s stack. This is all part of the call descriptor.

“It simplifies the code gen”

I see. So having to inject extra CPU instructions at each call site to ensure that the stack is aligned properly is simpler!? You could argue that the compiler’s code generator has to keep track of the stack depth and offsets anyway, so this is minimal extra overhead. But that’s my point, why burden every function with this level of housekeeping when it is not necessary for the current function to operate correctly?

“Maybe it was because the Mac OS X initially targeted the PowerPC?”

When Mac OS X was first introduced, the whole Macintosh line of computers had already transitioned from the Motorola 680xx line of CPUs to the PowerPC RISC CPU. When Apple decided to completely switch all its hardware over to the Intel x86 and x86-64 architectures, it is possible (and unless I can find information to the contrary) and indeed probable, that the insular nature of the Apple culture directly lead to a vast misunderstanding of this aspect of the 32bit x86 architecture. Failure to actually look at other very successful Intel x86 operating systems and architectures, such as, oh.. I don’t know…  Windows and Linux?

I guess the frustrating thing about all this is that 32bit x86 code generated for the Mac will have extra overhead that is clearly not necessary or even desired for other platforms, such as Windows or Linux. This is like requiring all my neighbors to keep my house clean. Sure, if your compiler is going to do some kind of global or profile guided optimization, you may want to do more stack manipulations through out the application. But that is a highly specialized and rare case, and AFAIK, the tools on the platform don’t do anything like that (GCC, ObjectiveC).

When we first found out about this requirement among the dearth of documentation on these low-level details (I’m sorry, but Apple’s overall OS documentation pales in comparison to Windows’ or even Linux’s. Mac fans, flame on ;-), I posed a question to Stack Overflow figuring that since there is such a wide range of experts out there, that surely a clearly satisfactory explanation would be available in short order. I was wrong. That question has been up there for over 9 months and I still get up-votes periodically.

Even though there is one answer I selected, it doesn’t seem that the Mac ABI even adheres to the Intel doc! "It is important to ensure that the stack frame is aligned to a 16-byte boundary upon function entry to keep local __m128 data, parameters, and XMM register spill locations aligned throughout a function invocation." It says, “upon function entry” yet this isn’t the case. The Mac ABI requires the alignment to be aligned at the call site, and not on function entry! Remember that the CALL instruction will automatically push the return address onto the stack, which for x86-32 is a 32bit (DWORD) sized. Why isn’t the stack merely prepared to be aligned at the call site so that when the return address is pushed, the stack is now aligned. This would mean that ESP would be $xxxxxxx4 at the call site and $xxxxxxx0 upon entry. It is also possible interpret this statement that the function prologue code is what is responsible for doing this alignment, and not necessarily the caller. This would clearly jive with the title of this post.

So there you have, a long rambling diatribe. Why does this even matter if the compiler just handles it? Because we’re having to go through all our optimized, hand coded assembly code and make sure it keeps the stack properly aligned. It also means that for all our customers out there that also like to dabble in hand coding assembler will need to now take this into account. This coupled with the re-emergence of Position Independent Code (PIC), we’re having a jolly old time… Let the ensuing discussion begin… I’m really interested in knowing the actual reasons for this requirement… I mean could the really smart OS and Tools folks at Apple gotten this all wrong? I really have a hard time believing that because you’d think that someone would have caught it. Yet, seemingly obvious software bugs sneak out all the time (yeah, yeah, we’re just as guilty here, you got me on that one!).

Pre-emptive snarky comment: Well, if you had a compiler that did better optimizations you wouldn’t have these problems! Thank you, Captain Oblivious! Our code generator is a trade-off between decent optimizations and very fast codegen. We only do the biggest-bang-for-the-buck optimizations. We’ve even added some optimizations in recent history, like function inlining.

*ABI – Application Binary Interface. This is different from an API (Application Programming Interface), which only defines the programming model of a group of provided functions. An ABI defines the raw mechanics of how an API is implemented. An API may define a calling convention that is used, whereas the ABI defines exactly how that convention is to actually work. This includes things like how and where are the parameters actually passed to the function. It also defines which CPU registers must be preserved, which ones are considered “working” registers and can be overwritten, and so on.