While I let my little tirade continue to simmer, I figured many folks’ next question will be “Ok, so there may be something here that affects me. What do I need to do?” Let’s first qualify who this will affect. If you fall into the following categories, then read on:
- I like to use the built-in assembler to produce some wicked-fast optimized functions (propeller beanie worn at all times…)
- I have to maintain a library of functions that contain bits of assembler
- I like to take apart my brand new gadgets just to see what makes them tick (Does my new 802.11n router use an ARM or MIPS CPU?)
- My brain hasn’t been melted in a while and I thought this would be fun
- I want to feel better about myself because I don’t really have to think about these kinds of things
Let’s start off with a simple example. Here’s an excerpt of code that lives down in the System.pas unit:
function _GetMem(Size: Integer): Pointer;
asm
TEST EAX,EAX
JLE @@negativeorzerosize
CALL MemoryManager.GetMem
TEST EAX,EAX
JZ @@getmemerror
REP RET // Optimization for branch prediction
@@getmemerror:
MOV AL,reOutOfMemory
JMP Error
@@negativeorzerosize:
XOR EAX, EAX
DB $F3
end;
Notice the CALL MemoryManager.GetMem
instruction. Due to the nature of what that call does, we know that it is very likely that a system call could be made so we’re going to have to ensure the stack is aligned according to the Mac OS ABI. Here's that function again with the requisite changes:
function _GetMem(Size: Integer): Pointer;
asm
TEST EAX,EAX
JLE @@negativeorzerosize
{$IFDEF ALIGN_STACK}
SUB ESP, 12
{$ENDIF ALIGN_STACK}
CALL MemoryManager.GetMem
{$IFDEF ALIGN_STACK}
ADD ESP, 12
{$ENDIF ALIGN_STACK}
TEST EAX,EAX
JZ @@getmemerror
REP RET // Optimization for branch prediction
@@getmemerror:
MOV AL,reOutOfMemory
JMP Error
@@negativeorzerosize:
XOR EAX, EAX
DB $F3
end;
When compiling for the Mac OS, the compiler will define ALIGN_STACK so you know that this compile target requires stack alignment. So how did we come up with the value '12' in which to adjust the stack. If you remember from my previous article, we know that upon entry to the function, the value in ESP should be $xxxxxxxC. Couple that with the fact that up until the actual call, we've done nothing to change the value of ESP, we know where the stack is in the alignment. Since the stack always grows down in memory (toward lower value addresses), we need change ESP to $xxxxxxx0 by subtracting $C, which is 12 decimal. Now the call can be made and we'll know that upon entry to MemoryManager.GetMem, ESP, once again, will be $xxxxxxxC.
That was a relatively trivial example since there was only one call out to an function that may make a system call. Consider a case where MemoryManager.GetMem was just a black-box call and you had no clue what it would actually do. You cannot ever be certain that any given call will not lead to a system call, so the stack needs to be aligned to a known point just in case the system is called.
Another point I need to make here is that if the call goes across a shared library boundary, even if the shared library is also written in Delphi, you will be making a system call the first time it is invoked. This is because all function imports, like Linux, are late-bound. Upon the first call to the external reference, it will go through the dynamic linker/loader that will resolve the address of the function and back-patch the call site so that the next call goes directly to the imported function.
What happens if the stack is misaligned? This is the most insidious thing about all this. There are only certain points where the stack is actually checked for proper alignment. The just mentioned case where you’re making a cross-library call is probably the most likely place this will be encountered. One of the first things the dynamic linker/loader does is to verify stack alignment. If the stack is not aligned properly, then a mach EXC_BAD_ACCESS exception is thrown (this is different than how exceptions are done in Windows, see Eli’s post related to exception handling). The problem is that the stack alignment could have been thrown off by one function, hundreds of calls back along the call chain! That is really “fun” to track down where it first got misaligned.
Suppose the function above now had a stack frame? What would the align value be in that case? The typical stack frame, assuming no stack-based local variables, would look like this:
function MyAsmFunction: Integer;
asm
PUSH EBP
MOV EBP,ESP
{ Body of function }
POP EBP
end;
In this case the stack pointer (ESP) will contain the value, $xxxxxxx8 which is 4 bytes for the return address and 4 bytes for the saved value of EBP. If no other stack changes are made, surrounding any CALL instruction, assuming you’re not pushing arguments onto the stack which we’ll cover in a moment, there would be a SUB ESP,8 and ADD ESP,8 instead of the previous 12.
Now, this is where it gets complicated, which clearly demonstrates why compilers are pretty good at this sort of thing. What if you wanted to call a function from assembler that expected all the arguments too be passed on the stack? Remember that at the call site (ie. just prior to the CALL instruction), the stack must be aligned to a 16 byte boundary and contain $xxxxxxx0. In this case you cannot simply push the parameters on the stack and then do the alignment. You must now align the stack before pushing parameters onto it knowing how the stack will be aligned after all the parameters are pushed. So if I need to push 2 DWORD parameters onto the stack and the current ESP value is $xxxxxxxC, you need to adjust the stack by 4 bytes (SUB ESP,4). ESP will now contain $xxxxxxx8. Then push the two parameters onto the stack which adjusts ESP to $xxxxxxx0, and we’ve satisfied the alignment criterion.
If the previous example had required 3 DWORDS, then no adjustment of the stack would be needed since after pushing 3 DWORDS(that’s 12 bytes), the stack would have been $xxxxxxx0, and we’re aligned. Likewise, if the above example had required 4 DWORD to be pushed, then now we’re literally “wasting” 12 extra bytes of stack. because 4 DWORDS is 16 bytes, that block of data will naturally align, so we have to start pushing the parameters on a 16 byte boundary. That means we’re back to adjusting the stack by the full 12 bytes, pushing 16 bytes onto the stack and then making the call. For a function call taking 16 bytes, we’re actually using 28 bytes of stack space instead of only 16! Add in stack-based local variables and you can see how complicated this can quickly get.
Remember, this is also happening behind the scenes within all your Delphi code. The compiler is constantly keeping track of how the stack is being modified as the code is generated. It then uses this information to know how to generate the proper SUB ESP,ADD ESP instructions. This could mean that code that was deeply recursive that worked fine on Windows, would now possibly blow out the stack on the Mac OS! Yes, this is admittedly a rare case since stacks tend to be fairly large (1MB or more), but it is still something to consider. Consider changing your recursive algorithm to iterative instead in order to keep the stack shallower and cleaner.
You should really consider whether or not your hand-coded assembler function needs to be in assembler and if it would work just as well if it were in Pascal. We’re evaluating this very thing, especially for functions that are not used as often or have been assembly merely due to historical reasons. Like you, we also understand that there is a clear benefit to having a nice hand-optimized bit of assembler. For instance, the Move() function in the System unit was painstakingly optimized by members of the FastCode project. Everyone clearly benefits from the optimization that function provided since it is heavily used throughout the RTL itself, but also by many, many users. Note here that the Move() function required no stack alignment changes since it makes no calls outside its own block of code, so it is just as fast and optimized as before. It runs unchanged on all (x82-32bit) platforms.
A thought: Since the compiler knows about all this ewwy gooewy goodness, perhaps the heavy lifting can be left to it.
ReplyDeleteINstead of:
{$IFDEF ALIGN_STACK}
SUB ESP, 12
{$ENDIF ALIGN_STACK}
(and all the ugly hand counting that will ensue with EVERY minor chnage)
Perhaps a compiler keywork that works in ASM context like:
DoMacAlignStack and UndoMacAlignStack
The compiler can figure out what the stack pointer should be, and figure out the right amount to adjust it by. If the compile target does not need stack alignment, no machine code gets output at all and suddenly it becomes easier for EVERYONE to deal with the stack alignment issues, and a change someone doesn't result in breaking the living crap out of the system because someone forgot to change a constant somewhere and retest it.
I realize there are always some edge cases where someone will want to screw with the ESP and will cause this to not work (and they can always fall back on doing it the hardway), but it might just save you a huge pile of work and problems.
Do other compilers do this, for example Free Pascal? Did you check?
ReplyDeleteWhat happens if you don't align? Surely there will be no crash or anything like that?
> What happens if you don’t align? Surely there will be no
ReplyDelete> crash or anything like that?
It depends on whether or not the function you call does an alignment check. As I mentioned, if you make a call across a share object boundary and the stack is misaligned, you *will* get an error, which will translate into an exception. Worse case is that it *will* crash.
> Do other compilers do this, for example Free Pascal?
ReplyDeleteNo, it doesn't. Parsing user-written assembler code and trying to understand what exactly it does so you can always do the right thing is immensely hard to get right (if possible at all, even if you'd implement a complete x86 emulator in your compiler). And "getting it right most of the time" is not really an acceptable approach in a compiler.
@Jonas,
ReplyDeleteExactly right. Aside from simple jump distance optimizations, you should never muck around with the assembly semantics. When doing assembly, I expect it to be WYSIWYG.
An alternative could be to align the address by two assembly instructions: OR 0xF to ESP and then Sub 15, so no matter how misaligned it was, now the stack will be aligned, and this is still WYSIWYG. Obviously you need something to get the previous value (a register?!) so Allens approach is better (less instructions, less registers, etc.) for the optimized assembler code than this "general align stack now" code, but there you have to find the concrete value (4, 8, 12 or nothing?), which might get a nightmare when code is changed. Maybe you have a code with several paths that are only left in a few rare cases to the "outside": As in "Move", you don't have to care about alignment in most of your code, and you can add a few "align now" instructions around the call that will leave your code.
ReplyDelete> It runs unchanged on all (x82-32bit) platforms
ReplyDeleteImho, proper name for platform is x86- 32bit!)) Have fun. [From Siberia with love]
I like the idea of a new keyword:
ReplyDeleteasm
...
AUTOASM(ALIGN); { emit stack alignment code if any }
end;
"...
ReplyDeletePerhaps a compiler keywork that works in ASM context like:
DoMacAlignStack and UndoMacAlignStack
..."
compiler should inject this on every CALL, without we have to worry about. Just define a 'APPLE_MAC' constant on project, and everything should be handler behind the scene.