On Tue, Mar 4, 2025 at 2:16 PM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > > On Tue, Mar 04, 2025 at 06:38:53AM +0100, Peter Zijlstra wrote: > > On Tue, Mar 04, 2025 at 09:10:12AM +0800, Menglong Dong wrote: > > > Hello, sorry that I forgot to add something to the changelog. In fact, > > > I don't add extra 5-bytes anymore, which you can see in the 3rd patch. > > > > > > The thing is that we can't add extra 5-bytes if CFI is enabled. Without > > > CFI, we can make the padding space any value, such as 5-bytes, and > > > the layout will be like this: > > > > > > __align: > > > nop > > > nop > > > nop > > > nop > > > nop > > > foo: -- __align +5 > > > > > > However, the CFI will always make the cfi insn 16-bytes aligned. When > > > we set the FUNCTION_PADDING_BYTES to (11 + 5), the layout will be > > > like this: > > > > > > __cfi_foo: > > > nop (11) > > > mov $0x12345678, %reg > > > nop (16) > > > foo: > > > > > > and the padding space is 32-bytes actually. So, we can just select > > > FUNCTION_ALIGNMENT_32B instead, which makes the padding > > > space 32-bytes too, and have the following layout: > > > > > > __cfi_foo: > > > mov $0x12345678, %reg > > > nop (27) > > > foo: > > > > *blink*, wtf is clang smoking. > > > > I mean, you're right, this is what it is doing, but that is somewhat > > unexpected. Let me go look at clang source, this is insane. > > Bah, this is because assemblers are stupid :/ > > There is no way to tell them to have foo aligned such that there are at > least N bytes free before it. > > So what kCFI ends up having to do is align the __cfi symbol to the > function alignment, and then stuff enough nops in to make the real > symbol meet alignment. > > And the end result is utter insanity. > > I mean, look at this: > > 50: 2e e9 00 00 00 00 cs jmp 56 <__traceiter_initcall_level+0x46> 52: R_X86_64_PLT32 __x86_return_thunk-0x4 > 56: 66 2e 0f 1f 84 00 00 00 00 00 cs nopw 0x0(%rax,%rax,1) > > 0000000000000060 <__cfi___probestub_initcall_level>: > 60: 90 nop > 61: 90 nop > 62: 90 nop > 63: 90 nop > 64: 90 nop > 65: 90 nop > 66: 90 nop > 67: 90 nop > 68: 90 nop > 69: 90 nop > 6a: 90 nop > 6b: b8 b1 fd 66 f9 mov $0xf966fdb1,%eax > > 0000000000000070 <__probestub_initcall_level>: > 70: 2e e9 00 00 00 00 cs jmp 76 <__probestub_initcall_level+0x6> 72: R_X86_64_PLT32 __x86_return_thunk-0x4 > > > That's 21 bytes wasted, for no reason other than that asm doesn't have a > directive to say: get me a place that is M before N alignment. > > Because ideally the whole above thing would look like: > > 50: 2e e9 00 00 00 00 cs jmp 56 <__traceiter_initcall_level+0x46> 52: R_X86_64_PLT32 __x86_return_thunk-0x4 > 56: 66 2e 0f 1f 84 cs nopw (%rax,%rax,1) > > 000000000000005b <__cfi___probestub_initcall_level>: > 5b: b8 b1 fd 66 f9 mov $0xf966fdb1,%eax > > 0000000000000060 <__probestub_initcall_level>: > 60: 2e e9 00 00 00 00 cs jmp 76 <__probestub_initcall_level+0x6> 72: R_X86_64_PLT32 __x86_return_thunk-0x4 Hi, peter. Thank you for the testing, which is quite helpful to understand the whole thing. I was surprised at this too. Without CALL_PADDING, the cfi is nop(11) + mov; with CALL_PADDING, the cfi is mov + nop(11), which is weird, as it seems that we can select CALL_PADDING if CFI_CLANG to make things consistent. And I thought that it is designed to be this for some reasons :/ Hmm......so what should we do now? Accept and bear it, or do something different? I'm good at clang, so the solution that I can think of is how to bear it :/ According to my testing, the text size will increase: ~2.2% if we make FUNCTION_PADDING_BYTES 27 and select FUNCTION_ALIGNMENT_16B. ~3.5% if we make FUNCTION_PADDING_BYTES 27 and select FUNCTION_ALIGNMENT_32B. We don't have to select FUNCTION_ALIGNMENT_32B, so the worst case is to increase ~2.2%. What do you think? Thanks! Menglong Dong > > >