Now also with the patch attached. Uros. On Sun, Sep 17, 2023 at 8:31 PM Uros Bizjak <ubizjak@xxxxxxxxx> wrote: > > On Fri, Sep 15, 2023 at 6:45 PM Linus Torvalds > <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > > > > On Fri, 15 Sept 2023 at 04:25, tip-bot2 for Uros Bizjak > > <tip-bot2@xxxxxxxxxxxxx> wrote: > > > > > > Several places in mm/slub.o improve from e.g.: > > > > > [...] > > > > > > to: > > > > > > 53bc: 48 8d 4a 40 lea 0x40(%rdx),%rcx > > > 53c0: 49 8b 1c 07 mov (%r15,%rax,1),%rbx > > > 53c4: 4c 89 f8 mov %r15,%rax > > > 53c7: 48 8d 37 lea (%rdi),%rsi > > > 53ca: e8 00 00 00 00 call 53cf <...> > > > 53cb: R_X86_64_PLT32 this_cpu_cmpxchg16b_emu-0x4 > > > 53cf: 75 bb jne 538c <...> > > > > Honestly, if y ou care deeply about this code sequence, I think you > > should also move the "lea" out of the inline asm. > > I have to say that the above asm code was shown mostly as an example > of the improvement, to illustrate how the compare sequence at the end > of the cmpxchg loop gets eliminated. Being a fairly mechanical change, > I didn't put much thought in the surrounding code. > > > Both > > > > call this_cpu_cmpxchg16b_emu > > > > and > > > > cmpxchg16b %gs:(%rsi) > > > > are 5 bytes, and I suspect it's easiest to just always put the address > > in %rsi - whether you call the function or not. > > > > It doesn't really make the code generation for the non-call sequence > > worse, and it gives the compiler more information (ie instead of > > clobbering %rsi, the compiler knows what %rsi contains). > > > > IOW, something like this: > > > > - asm qual (ALTERNATIVE("leaq %P[var], %%rsi; call > > this_cpu_cmpxchg16b_emu", \ > > + asm qual (ALTERNATIVE("call this_cpu_cmpxchg16b_emu", \ > > ... > > - "c" (new__.high) \ > > - : "memory", "rsi"); \ > > + "c" (new__.high), \ > > + "S" (&_var) \ > > + : "memory"); \ > > > > should do it. > > Yes, and the above change improves slub.o assembly from (current tip > tree with try_cmpxchg patch applied): > > 53b3: 41 8b 44 24 28 mov 0x28(%r12),%eax > 53b8: 49 8b 3c 24 mov (%r12),%rdi > 53bc: 48 8d 4a 40 lea 0x40(%rdx),%rcx > 53c0: 49 8b 1c 07 mov (%r15,%rax,1),%rbx > 53c4: 4c 89 f8 mov %r15,%rax > 53c7: 48 8d 37 lea (%rdi),%rsi > 53ca: e8 00 00 00 00 call 53cf <kmem_cache_alloc+0x9f> > 53cb: R_X86_64_PLT32 this_cpu_cmpxchg16b_emu-0x4 > 53cf: 75 bb jne 538c <kmem_cache_alloc+0x5c> > > to: > > 53b3: 41 8b 44 24 28 mov 0x28(%r12),%eax > 53b8: 49 8b 34 24 mov (%r12),%rsi > 53bc: 48 8d 4a 40 lea 0x40(%rdx),%rcx > 53c0: 49 8b 1c 07 mov (%r15,%rax,1),%rbx > 53c4: 4c 89 f8 mov %r15,%rax > 53c7: e8 00 00 00 00 call 53cc <kmem_cache_alloc+0x9c> > 53c8: R_X86_64_PLT32 this_cpu_cmpxchg16b_emu-0x4 > 53cc: 75 be jne 538c <kmem_cache_alloc+0x5c> > > where an effective reg-reg move "lea (%rdi), %rsi" at 537c gets > removed. And indeed, GCC figures out that %rsi holds the address of > the variable and emits: > > 5: 65 48 0f c7 0e cmpxchg16b %gs:(%rsi) > > alternative replacement. > > Now, here comes the best part: We can get rid of the %P modifier. With > named address spaces (__seg_gs), older GCCs had some problems with %P > and emitted "%gs:foo" instead of foo, resulting in "Warning: segment > override on `lea' is ineffectual" assembly warning. With the proposed > change, we use: > > --cut here-- > int __seg_gs g; > > void foo (void) > { > asm ("%0 %1" :: "m"(g), "S"(&g)); > } > --cut here-- > > and get the desired assembly: > > movl $g, %esi > %gs:g(%rip) %rsi > > The above is also in line with [1], where it is said that > "[__seg_gs/__seg_fs] address spaces are not considered to be subspaces > of the generic (flat) address space." So, cmpxchg16b_emu.S must use > %gs to apply segment base offset, which it does. > > > Note that I think this is particularly true of the slub code, because > > afaik, the slub code will *only* use the slow call-out. > > > > Why? Because if the CPU actually supports the cmpxchgb16 instruction, > > then the slub code won't even take this path at all - it will do the > > __CMPXCHG_DOUBLE path, which does an unconditional locked cmpxchg16b. > > > > Maybe I'm misreading it. And no, none of this matters. But since I saw > > the patch fly by, and slub.o mentioned, I thought I'd point out how > > silly this all is. It's optimizing a code-path that is basically never > > taken, and when it *is* taken, it can be improved further, I think. > > True, but as mentioned above, the slub.o code was used to illustrate > the effect of the patch. The new locking primitive should be usable in > a general way and could be also used in other places. > > [1] https://gcc.gnu.org/onlinedocs/gcc/Named-Address-Spaces.html#x86-Named-Address-Spaces > > Uros.
diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h index a87db6140fe2..331a9d4dce82 100644 --- a/arch/x86/include/asm/percpu.h +++ b/arch/x86/include/asm/percpu.h @@ -242,14 +242,15 @@ do { \ old__.var = _oval; \ new__.var = _nval; \ \ - asm qual (ALTERNATIVE("leal %P[var], %%esi; call this_cpu_cmpxchg8b_emu", \ + asm qual (ALTERNATIVE("call this_cpu_cmpxchg8b_emu", \ "cmpxchg8b " __percpu_arg([var]), X86_FEATURE_CX8) \ : [var] "+m" (_var), \ "+a" (old__.low), \ "+d" (old__.high) \ : "b" (new__.low), \ - "c" (new__.high) \ - : "memory", "esi"); \ + "c" (new__.high), \ + "S" (&_var) \ + : "memory"); \ \ old__.var; \ }) @@ -271,7 +272,7 @@ do { \ old__.var = *_oval; \ new__.var = _nval; \ \ - asm qual (ALTERNATIVE("leal %P[var], %%esi; call this_cpu_cmpxchg8b_emu", \ + asm qual (ALTERNATIVE("call this_cpu_cmpxchg8b_emu", \ "cmpxchg8b " __percpu_arg([var]), X86_FEATURE_CX8) \ CC_SET(z) \ : CC_OUT(z) (success), \ @@ -279,8 +280,9 @@ do { \ "+a" (old__.low), \ "+d" (old__.high) \ : "b" (new__.low), \ - "c" (new__.high) \ - : "memory", "esi"); \ + "c" (new__.high), \ + "S" (&_var) \ + : "memory"); \ if (unlikely(!success)) \ *_oval = old__.var; \ likely(success); \ @@ -309,14 +311,15 @@ do { \ old__.var = _oval; \ new__.var = _nval; \ \ - asm qual (ALTERNATIVE("leaq %P[var], %%rsi; call this_cpu_cmpxchg16b_emu", \ + asm qual (ALTERNATIVE("call this_cpu_cmpxchg16b_emu", \ "cmpxchg16b " __percpu_arg([var]), X86_FEATURE_CX16) \ : [var] "+m" (_var), \ "+a" (old__.low), \ "+d" (old__.high) \ : "b" (new__.low), \ - "c" (new__.high) \ - : "memory", "rsi"); \ + "c" (new__.high), \ + "S" (&_var) \ + : "memory"); \ \ old__.var; \ }) @@ -338,7 +341,7 @@ do { \ old__.var = *_oval; \ new__.var = _nval; \ \ - asm qual (ALTERNATIVE("leaq %P[var], %%rsi; call this_cpu_cmpxchg16b_emu", \ + asm qual (ALTERNATIVE("call this_cpu_cmpxchg16b_emu", \ "cmpxchg16b " __percpu_arg([var]), X86_FEATURE_CX16) \ CC_SET(z) \ : CC_OUT(z) (success), \ @@ -346,8 +349,9 @@ do { \ "+a" (old__.low), \ "+d" (old__.high) \ : "b" (new__.low), \ - "c" (new__.high) \ - : "memory", "rsi"); \ + "c" (new__.high), \ + "S" (&_var) \ + : "memory"); \ if (unlikely(!success)) \ *_oval = old__.var; \ likely(success); \