On Fri, 15 Sept 2023 at 04:25, tip-bot2 for Uros Bizjak <tip-bot2@xxxxxxxxxxxxx> wrote: > > Several places in mm/slub.o improve from e.g.: > [...] > > to: > > 53bc: 48 8d 4a 40 lea 0x40(%rdx),%rcx > 53c0: 49 8b 1c 07 mov (%r15,%rax,1),%rbx > 53c4: 4c 89 f8 mov %r15,%rax > 53c7: 48 8d 37 lea (%rdi),%rsi > 53ca: e8 00 00 00 00 call 53cf <...> > 53cb: R_X86_64_PLT32 this_cpu_cmpxchg16b_emu-0x4 > 53cf: 75 bb jne 538c <...> Honestly, if y ou care deeply about this code sequence, I think you should also move the "lea" out of the inline asm. Both call this_cpu_cmpxchg16b_emu and cmpxchg16b %gs:(%rsi) are 5 bytes, and I suspect it's easiest to just always put the address in %rsi - whether you call the function or not. It doesn't really make the code generation for the non-call sequence worse, and it gives the compiler more information (ie instead of clobbering %rsi, the compiler knows what %rsi contains). IOW, something like this: - asm qual (ALTERNATIVE("leaq %P[var], %%rsi; call this_cpu_cmpxchg16b_emu", \ + asm qual (ALTERNATIVE("call this_cpu_cmpxchg16b_emu", \ ... - "c" (new__.high) \ - : "memory", "rsi"); \ + "c" (new__.high), \ + "S" (&_var) \ + : "memory"); \ should do it. Note that I think this is particularly true of the slub code, because afaik, the slub code will *only* use the slow call-out. Why? Because if the CPU actually supports the cmpxchgb16 instruction, then the slub code won't even take this path at all - it will do the __CMPXCHG_DOUBLE path, which does an unconditional locked cmpxchg16b. Maybe I'm misreading it. And no, none of this matters. But since I saw the patch fly by, and slub.o mentioned, I thought I'd point out how silly this all is. It's optimizing a code-path that is basically never taken, and when it *is* taken, it can be improved further, I think. Linus