On Tue, Mar 04, 2025 at 06:38:53AM +0100, Peter Zijlstra wrote: > On Tue, Mar 04, 2025 at 09:10:12AM +0800, Menglong Dong wrote: > > Hello, sorry that I forgot to add something to the changelog. In fact, > > I don't add extra 5-bytes anymore, which you can see in the 3rd patch. > > > > The thing is that we can't add extra 5-bytes if CFI is enabled. Without > > CFI, we can make the padding space any value, such as 5-bytes, and > > the layout will be like this: > > > > __align: > > nop > > nop > > nop > > nop > > nop > > foo: -- __align +5 > > > > However, the CFI will always make the cfi insn 16-bytes aligned. When > > we set the FUNCTION_PADDING_BYTES to (11 + 5), the layout will be > > like this: > > > > __cfi_foo: > > nop (11) > > mov $0x12345678, %reg > > nop (16) > > foo: > > > > and the padding space is 32-bytes actually. So, we can just select > > FUNCTION_ALIGNMENT_32B instead, which makes the padding > > space 32-bytes too, and have the following layout: > > > > __cfi_foo: > > mov $0x12345678, %reg > > nop (27) > > foo: > > *blink*, wtf is clang smoking. > > I mean, you're right, this is what it is doing, but that is somewhat > unexpected. Let me go look at clang source, this is insane. Bah, this is because assemblers are stupid :/ There is no way to tell them to have foo aligned such that there are at least N bytes free before it. So what kCFI ends up having to do is align the __cfi symbol to the function alignment, and then stuff enough nops in to make the real symbol meet alignment. And the end result is utter insanity. I mean, look at this: 50: 2e e9 00 00 00 00 cs jmp 56 <__traceiter_initcall_level+0x46> 52: R_X86_64_PLT32 __x86_return_thunk-0x4 56: 66 2e 0f 1f 84 00 00 00 00 00 cs nopw 0x0(%rax,%rax,1) 0000000000000060 <__cfi___probestub_initcall_level>: 60: 90 nop 61: 90 nop 62: 90 nop 63: 90 nop 64: 90 nop 65: 90 nop 66: 90 nop 67: 90 nop 68: 90 nop 69: 90 nop 6a: 90 nop 6b: b8 b1 fd 66 f9 mov $0xf966fdb1,%eax 0000000000000070 <__probestub_initcall_level>: 70: 2e e9 00 00 00 00 cs jmp 76 <__probestub_initcall_level+0x6> 72: R_X86_64_PLT32 __x86_return_thunk-0x4 That's 21 bytes wasted, for no reason other than that asm doesn't have a directive to say: get me a place that is M before N alignment. Because ideally the whole above thing would look like: 50: 2e e9 00 00 00 00 cs jmp 56 <__traceiter_initcall_level+0x46> 52: R_X86_64_PLT32 __x86_return_thunk-0x4 56: 66 2e 0f 1f 84 cs nopw (%rax,%rax,1) 000000000000005b <__cfi___probestub_initcall_level>: 5b: b8 b1 fd 66 f9 mov $0xf966fdb1,%eax 0000000000000060 <__probestub_initcall_level>: 60: 2e e9 00 00 00 00 cs jmp 76 <__probestub_initcall_level+0x6> 72: R_X86_64_PLT32 __x86_return_thunk-0x4