On Monday 27 August 2007, tim prince wrote: > Mihai Donțu wrote: > > ".p2align 4,,15" I said to myself: "good > > to know" and did the necessary changes in my "*.S" files. > > Indeed, what was before a nasty unaligned code, now it's nicely put at a > > 16byte boundary. However, to my disapointment, this did not make the code > > run faster :(. "Au contraire", it made it run slower. So why is gcc using it? > > Or am I missing something? > > > > I've tested this on an AMD64 (Turion @ 2.2GHz) machine. > > > Did you check your object files, to see whether your linker has observed > those alignments? Several years into the SSE era, gnu binutils for > Windows was still configured so as to disable 16-byte alignment. I'm > told it's still that way on Solaris. > As the name indicates, the specific version you quote is designed for > P-II. It won't have as remarkable an effect on other CPUs; I don't even > know whether anyone has checked this out on Turion. In any case, it > would normally show a significant gain only for the head of a frequently > executed loop, and likely only in the case where it avoids an orphan > partial instruction at the loop head. > You didn't even say whether you are running in 64-bit mode, where there > are more possibilities for orphans, such as where the first 2 bytes of > an LCP instruction form an orphan. Depending on your specific > combination of circumstances, you might be interested in trying > variations, such as .p2align 4,,2. Sorry for the dalayed response. I've been extremely busy :( So: I'm on a 64bit Gentoo GNU/Linux, stable, gcc 4.1.2 with the latest and greatest binutils :) Since I use Gentoo, you can imagine I'm a speed freak :), thus I'm using the following rule when building my files: %.o: %.S gcc -c -g -pipe -ansi -std=gnu99 -W -Wall -Winline -Wdisabled-optimization \ -Wmissing-prototypes -march=athlon64 -fPIC -DPIC -DNDEBUG -DNMMUNIT -D_REENTRANT \ -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -I. -O3 -fno-exceptions \ -fomit-frame-pointer -o $@ $< The assembler file I'm compiling, contains 12 routines (stubs) which make the 'calling-convention-switch' (WIN64->x86_64) (just like NDIS Wrapper does). Yes, this is the _new_shiny_thing_ these days :) Now, these stubs (or as I call them: trampolines) induce a fair amount of delay in the program execution, making this the reason for which I turned to '.p2align' (never ONCE did I make the association between the name and Intel PII). Why I need all this: I have a tool that loads a DLL, provides the basic needs for it (a couple of kernel32.dll routines) and gives control to it. I use this DLL to perform a certain type of analysis on some files (7447 of them to be exact). This DLL calls HeapAlloc() and HeapFree() (along with some others) frequently, via a code like this (automatically generated at link time or when GetProcAddress() is called): movq <pe_address>, %r10 /* the address of the routine that needs to be called */ movq <stub_address>, %r11 /* the address of the trampoline */ jmpq *%r11d /* jump to the required trampoline */ A trampoline, looks like this: 00000000000003f0 <x86_64pc5>: 3f0: 48 89 7c 24 f8 mov %rdi,0xfffffffffffffff8(%rsp) 3f5: 48 89 74 24 f0 mov %rsi,0xfffffffffffffff0(%rsp) 3fa: 48 83 ec 10 sub $0x10,%rsp 3fe: 48 89 cf mov %rcx,%rdi 401: 48 89 d6 mov %rdx,%rsi 404: 4c 89 c2 mov %r8,%rdx 407: 4c 89 c9 mov %r9,%rcx 40a: 4c 8b 44 24 38 mov 0x38(%rsp),%r8 40f: 48 31 c0 xor %rax,%rax 412: 41 ff d2 callq *%r10 415: 48 83 c4 10 add $0x10,%rsp 419: 48 8b 7c 24 f8 mov 0xfffffffffffffff8(%rsp),%rdi 41e: 48 8b 74 24 f0 mov 0xfffffffffffffff0(%rsp),%rsi 423: c3 retq 424: 66 data16 425: 66 data16 426: 66 data16 427: 90 nop 428: 66 data16 429: 66 data16 42a: 66 data16 42b: 90 nop 42c: 66 data16 42d: 66 data16 42e: 66 data16 42f: 90 nop 430: /* next trampoline (x86_64pc6) */ Now, without '.p2align' the tool analyses all 7447 files in approx. 4:30 minutes. With '.p2align 4,,15'' it rises to approx: 4.50 minutes (not much, but I'm going easy, with 790MB of files - when I'm done optimizing, this tool will "dive" into tens of GB). I'm looking at what gcc does and it seems to believe that '.p2align 4,,15' is *the* alignment to use for *all* functions and some jump points (jump points usually get '.p2align 4,,7') I've tried '.p2align 4,,2': it is *slightly* faster than '.p2align 4,,16' but not as fast, as without '.p2align'. Anyway, my question was: why is gcc so "found" of '.p2align' (it uses it in *all* situations) since it does not always generate fast code. Other than that, gcc does a great job! ;) -- Mihai Donțu