Friday, January 15, 2010

Mac OS Stack alignment – What you may need to know

While I let my little tirade continue to simmer, I figured many folks’ next question will be “Ok, so there may be something here that affects me. What do I need to do?” Let’s first qualify who this will affect. If you fall into the following categories, then read on:

  • I like to use the built-in assembler to produce some wicked-fast optimized functions (propeller beanie worn at all times…)
  • I have to maintain a library of functions that contain bits of assembler
  • I like to take apart my brand new gadgets just to see what makes them tick (Does my new 802.11n router use an ARM or MIPS CPU?)
  • My brain hasn’t been melted in a while and I thought this would be fun
  • I want to feel better about myself because I don’t really have to think about these kinds of things

Let’s start off with a simple example. Here’s an excerpt of code that lives down in the System.pas unit:

function _GetMem(Size: Integer): Pointer;
asm
TEST EAX,EAX
JLE @@negativeorzerosize
CALL MemoryManager.GetMem
TEST EAX,EAX
JZ @@getmemerror
REP RET // Optimization for branch prediction
@@getmemerror:
MOV AL,reOutOfMemory
JMP Error
@@negativeorzerosize:
XOR EAX, EAX
DB $F3
end;

Notice the CALL MemoryManager.GetMem instruction. Due to the nature of what that call does, we know that it is very likely that a system call could be made so we’re going to have to ensure the stack is aligned according to the Mac OS ABI. Here's that function again with the requisite changes:


function _GetMem(Size: Integer): Pointer;
asm
TEST EAX,EAX
JLE @@negativeorzerosize
{$IFDEF ALIGN_STACK}
SUB ESP, 12
{$ENDIF ALIGN_STACK}
CALL MemoryManager.GetMem
{$IFDEF ALIGN_STACK}
ADD ESP, 12
{$ENDIF ALIGN_STACK}
TEST EAX,EAX
JZ @@getmemerror
REP RET // Optimization for branch prediction
@@getmemerror:
MOV AL,reOutOfMemory
JMP Error
@@negativeorzerosize:
XOR EAX, EAX
DB $F3
end;

When compiling for the Mac OS, the compiler will define ALIGN_STACK so you know that this compile target requires stack alignment. So how did we come up with the value '12' in which to adjust the stack. If you remember from my previous article, we know that upon entry to the function, the value in ESP should be $xxxxxxxC. Couple that with the fact that up until the actual call, we've done nothing to change the value of ESP, we know where the stack is in the alignment. Since the stack always grows down in memory (toward lower value addresses), we need change ESP to $xxxxxxx0 by subtracting $C, which is 12 decimal. Now the call can be made and we'll know that upon entry to MemoryManager.GetMem, ESP, once again, will be $xxxxxxxC.


That was a relatively trivial example since there was only one call out to an function that may make a system call. Consider a case where MemoryManager.GetMem was just a black-box call and you had no clue what it would actually do. You cannot ever be certain that any given call will not lead to a system call, so the stack needs to be aligned to a known point just in case the system is called.



Another point I need to make here is that if the call goes across a shared library boundary,  even if the shared library is also written in Delphi, you will be making a system call the first time it is invoked. This is because all function imports, like Linux, are late-bound. Upon the first call to the external reference, it will go through the dynamic linker/loader that will resolve the address of the function and back-patch the call site so that the next call goes directly to the imported function.


What happens if the stack is misaligned? This is the most insidious thing about all this. There are only certain points where the stack is actually checked for proper alignment. The just mentioned case where you’re making a cross-library call is probably the most likely place this will be encountered. One of the first things the dynamic linker/loader does is to verify stack alignment. If the stack is not aligned properly, then a mach EXC_BAD_ACCESS exception is thrown (this is different than how exceptions are done in Windows, see Eli’s post related to exception handling). The problem is that the stack alignment could have been thrown off by one function, hundreds of calls back along the call chain! That is really “fun” to track down where it first got misaligned.


Suppose the function above now had a stack frame? What would the align value be in that case? The typical stack frame, assuming no stack-based local variables, would look like this:


function MyAsmFunction: Integer;
asm
PUSH EBP
MOV EBP,ESP
{ Body of function }
POP EBP
end;

In this case the stack pointer (ESP) will contain the value, $xxxxxxx8 which is 4 bytes for the return address and 4 bytes for the saved value of EBP. If no other stack changes are made, surrounding any CALL instruction, assuming you’re not pushing arguments onto the stack which we’ll cover in a moment, there would be a SUB ESP,8 and ADD ESP,8 instead of the previous 12.


Now, this is where it gets complicated, which clearly demonstrates why compilers are pretty good at this sort of thing. What if you wanted to call a function from assembler that expected all the arguments too be passed on the stack? Remember that at the call site (ie. just prior to the CALL instruction), the stack must be aligned to a 16 byte boundary and contain $xxxxxxx0. In this case you cannot simply push the parameters on the stack and then do the alignment. You must now align the stack before pushing parameters onto it knowing how the stack will be aligned after all the parameters are pushed. So if I need to push 2 DWORD parameters onto the stack and the current ESP value is $xxxxxxxC, you need to adjust the stack by 4 bytes (SUB ESP,4). ESP will now contain $xxxxxxx8. Then push the two parameters onto the stack which adjusts ESP to $xxxxxxx0, and we’ve satisfied the alignment criterion.


If the previous example had required 3 DWORDS, then no adjustment of the stack would be needed since after pushing 3 DWORDS(that’s 12 bytes), the stack would have been $xxxxxxx0, and we’re aligned. Likewise, if the above example had required 4 DWORD to be pushed, then now we’re literally “wasting” 12 extra bytes of stack. because 4 DWORDS is 16 bytes, that block of data will naturally align, so we have to start pushing the parameters on a 16 byte boundary. That means we’re back to adjusting the stack by the full 12 bytes, pushing 16 bytes onto the stack and then making the call. For a function call taking 16 bytes, we’re actually using 28 bytes of stack space instead of only 16! Add in stack-based local variables and you can see how complicated this can quickly get.


Remember, this is also happening behind the scenes within all your Delphi code. The compiler is constantly keeping track of how the stack is being modified as the code is generated. It then uses this information to know how to generate the proper SUB ESP,ADD ESP instructions. This could mean that code that was deeply recursive that worked fine on Windows, would now possibly blow out the stack on the Mac OS! Yes, this is admittedly a rare case since stacks tend to be fairly large (1MB or more), but it is still something to consider. Consider changing your recursive algorithm to iterative instead in order to keep the stack shallower and cleaner.


You should really consider whether or not your hand-coded assembler function needs to be in assembler and if it would work just as well if it were in Pascal. We’re evaluating this very thing, especially for functions that are not used as often or have been assembly merely due to historical reasons. Like you, we also understand that there is a clear benefit to having a nice hand-optimized bit of assembler. For instance, the Move() function in the System unit was painstakingly optimized by members of the FastCode project. Everyone clearly benefits from the optimization that function provided since it is heavily used throughout the RTL itself, but also by many, many users. Note here that the Move() function required no stack alignment changes since it makes no calls outside its own block of code, so it is just as fast and optimized as before. It runs unchanged on all (x82-32bit) platforms.