Monday, October 10, 2011

More x64 assembler fun-facts–new assembler directives

The Windows x64 ABI (Application Binary Interface) presents some new challenges for assembly programming that don’t exist for x86. A couple of the changes that must be taken into account can can be seen as very positive. First of all, there is now one and only one OS specified calling convention. We certainly could have devised our own calling convention like in x86 where it is a register-based convention, however since the system calling convention was already register based, that would have been an unnecessary complication. The other significant change is that the stack must always remain aligned on 16 byte boundaries. This seems a little onerous at first, but I’ll explain how and why it’s necessary along how it can actually make calling other functions from assembly code more efficient and sometimes even faster than x86. For a detailed description of the calling convention, register usage and reservations, etc… please see this. Another thing that I’ll discuss is exceptions and why all of this is necessary.

For an given function there are three parts we’re going to talk about, the prolog, body, and epilog. The prologue and epilogue contain all the setup and tear-down of the function’s “frame”. The prolog is where all the space on the stack is reserved for local variables and, different from how the x86 compiler works, the space for the maximum number of parameter space needed for all the function calls within the body. The epilog does the reverse and releases the reserved stack space just prior to returning to the caller. The body of a function is where the user’s code is placed, either in Pascal, or as we’ll see this is where your assembler code you write will go.

You may be wondering why the prolog is reserving parameter space in addition to the space needed for local variables. Why not just push the parameters on the stack right before calling a function? While there is technically nothing keeping the compiler from placing parameters for a function call on the stack immediately before a call, this will have the effect of making the exception tables larger. As I mentioned above, exceptions in x64 are not implemented the same as in x86, which was a stack-based linked list of records. In x64, exceptions are done using extra data generated by the compiler that describes the stack changes for a given function and where the handlers/finally blocks are located. By only modifying the stack within the prolog and epilog, “unwinding” the stack is easier and more accurate. Another side benefit is that when passing stack parameters to functions, the space is already available so the data merely needs to be “MOV”ed onto the stack without the need for a PUSH. The stack also remains properly aligned, so no extra finagling of the RSP register is necessary.


Delphi for Windows 64bit introduced several new assembler directives or “pseudo-instructions”, .NOFRAME, .PARAMS, .PUSHNV, and .SAVENV. These directives allow you to control how the compiler sets up the context frame and ensures that the proper exception table information is generated.
Some functions never make calls to other functions. These are called “leaf” functions because the don’t do any further “branching” out to other functions, so like a tree, they represent the “leaf” For functions such as this, having a full stack frame may be extra overhead you want eliminate. While the compiler does try and eliminate the stack frame if it can, there are times that it simply cannot automatically figure this out. If you are certain a frame is unnecessary, you can use this directive as a hint to the compiler.
.PARAMS <max params>
This one may be a little confusing because it does not refer to the parameters passed into the current function, rather this directive should be placed near the top of the function (preferably before any actual CPU instructions) with a single ordinal parameter to tell the compiler what the maximum number of parameters will be needed for all the function calls within the body. This will allow the compiler to properly reserve extra, properly aligned, stack space for passing parameters to other functions. This number should reflect the maximum number of parameters for all functions and should include even those parameters that are passed in registers. If you’re going to call a function that takes 6 parameters, then you should use “.PARAMS 6”.

When you use the .PARAMS directive, a pseudo-variable @Params becomes available to simplify passing parameters to other functions. It’s fairly easy to load up a few registers and make a call, but the x64 calling convention also requires that callers reserve space on the stack even for register parameters. The .PARAMS directive ensures this is the case, so you should still use the .PARAMS directive even if you’re going to call a function in which all parameters are passed in registers. You use the @Params pseudo-variable as an array, where the first parameter is at index 0. You generally don’t actually use the first 4 array elements since those must be passed in registers, so you’ll start at parameter index 4. The default element size is the register size of 64bits, so if you want to pass a smaller value, you’ll need a cast or size override such as “DWORD PTR @Params[4]”, or “ @Params[4].Byte”. Using the @Params pseudo-variable will save the programmer from having to manually calculate the offsets based on alignments and local variables. UPDATE: I foobar’ed that one… The @Params[] array is an array of bytes, which allows you to address every byte of the parameters. Each parameter takes up 8 bytes (64bits), so you’ll need to scale accordingly to access each parameter. Casting or size overrides are still necessary. The above bad example should have been: “DWORD PTR @Params[4*8]” or “ @Params[4*8].Byte”. Sorry about that.
According to the x64 calling convention and register usage spec, there are some registers which are considered non-volatile. This means that certain registers are guaranteed to have the same value after a function call as it had before the function call. This doesn’t mean this register is not available for usage,  it just means the called function must ensure it is properly preserved and restored. The best place to preserve the value is on the stack, but that means space should be reserved for it. These directives provide both the function of ensuring the compiler includes space for the register in the generated prolog code and actually places the register’s value in that reserved location. It also ensures that the function epilog properly restores the register before cleaning up the local frame. .PUSHNV works with the 64bit general purpose registers RAX…R15 and .SAVENV works with the 128bit XMM0..XMM15 SSE2 registers. See the above link for a description of which registers are considered non-volatile. Even though you can specify any register, volatile or non-volatile as a parameter to these directives, only those registers which are actually non-volatile will be preserved. For instance, .PUSHNV R11 will assemble just fine, but no changes to the frame will be made. Whereas, .PUSHNV R12 will place a PUSH R12 instruction right after the PUSH RBP instruction in the prolog. The compiler will also continue to ensure that the stack remains aligned. Remember when I talked about why the stack must remain 16byte aligned? One key reason is that many SSE2 instructions which operate on 128bit memory entities require that the memory access be aligned on a 16byte boundary. Because the compiler ensures this is the case, the space reserved by the .SAVENV directive is guaranteed to be 16byte aligned.
Writing assembler code in the new x64 world can be daunting and frustrating due to the very strict requirements on stack alignment and exception meta-data. By using the above directives, you are signaling your intentions to the one thing that is pretty darn good at ensuring all those requirements are met; the compiler. You should always ensure the directives are placed at the top of the assembler function body before any actual CPU instructions. This makes sure the compiler has all the information and everything is already calculated for when it begins to see the actual CPU instructions and needs to know what the offset from RBP where that local variable is located. Also, by ensuring that all stack manipulations happen within the prolog and epilog, the system will be able to properly “unwind” the stack past a properly written assembler function. Without this data, the OS unwind process could become lost and at worst, skip exception handlers, or at worst call the wrong one and lead to further corruption. If the unwind process gets lost enough, the OS may simply kill the process without any warning, similar to what stack overflows do in 32bit (and 64bit).

No comments:

Post a Comment

Please keep your comments related to the post on which you are commenting. No spam, personal attacks, or general nastiness. I will be watching and will delete comments I find irrelevant, offensive and unnecessary.