Re: [tip: x86/asm] x86/percpu: Define {raw,this}_cpu_try_cmpxchg{64,128}

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Now also with the patch attached.

Uros.

On Sun, Sep 17, 2023 at 8:31 PM Uros Bizjak <ubizjak@xxxxxxxxx> wrote:
>
> On Fri, Sep 15, 2023 at 6:45 PM Linus Torvalds
> <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> >
> > On Fri, 15 Sept 2023 at 04:25, tip-bot2 for Uros Bizjak
> > <tip-bot2@xxxxxxxxxxxxx> wrote:
> > >
> > > Several places in mm/slub.o improve from e.g.:
> > >
> > [...]
> > >
> > > to:
> > >
> > >     53bc:       48 8d 4a 40             lea    0x40(%rdx),%rcx
> > >     53c0:       49 8b 1c 07             mov    (%r15,%rax,1),%rbx
> > >     53c4:       4c 89 f8                mov    %r15,%rax
> > >     53c7:       48 8d 37                lea    (%rdi),%rsi
> > >     53ca:       e8 00 00 00 00          call   53cf <...>
> > >                         53cb: R_X86_64_PLT32     this_cpu_cmpxchg16b_emu-0x4
> > >     53cf:       75 bb                   jne    538c <...>
> >
> > Honestly, if y ou care deeply about this code sequence, I think you
> > should also move the "lea" out of the inline asm.
>
> I have to say that the above asm code was shown mostly as an example
> of the improvement, to illustrate how the compare sequence at the end
> of the cmpxchg loop gets eliminated. Being a fairly mechanical change,
> I didn't put much thought in the surrounding code.
>
> > Both
> >
> >     call this_cpu_cmpxchg16b_emu
> >
> > and
> >
> >     cmpxchg16b %gs:(%rsi)
> >
> > are 5 bytes, and I suspect it's easiest to just always put the address
> > in %rsi - whether you call the function or not.
> >
> > It doesn't really make the code generation for the non-call sequence
> > worse, and it gives the compiler more information (ie instead of
> > clobbering %rsi, the compiler knows what %rsi contains).
> >
> > IOW, something like this:
> >
> > -       asm qual (ALTERNATIVE("leaq %P[var], %%rsi; call
> > this_cpu_cmpxchg16b_emu", \
> > +       asm qual (ALTERNATIVE("call this_cpu_cmpxchg16b_emu",           \
> > ...
> > -                   "c" (new__.high)                                    \
> > -                 : "memory", "rsi");                                   \
> > +                   "c" (new__.high),                                   \
> > +                   "S" (&_var)                                   \
> > +                 : "memory");                                          \
> >
> > should do it.
>
> Yes, and the above change improves slub.o assembly from (current tip
> tree with try_cmpxchg patch applied):
>
>     53b3:    41 8b 44 24 28           mov    0x28(%r12),%eax
>     53b8:    49 8b 3c 24              mov    (%r12),%rdi
>     53bc:    48 8d 4a 40              lea    0x40(%rdx),%rcx
>     53c0:    49 8b 1c 07              mov    (%r15,%rax,1),%rbx
>     53c4:    4c 89 f8                 mov    %r15,%rax
>     53c7:    48 8d 37                 lea    (%rdi),%rsi
>     53ca:    e8 00 00 00 00           call   53cf <kmem_cache_alloc+0x9f>
>             53cb: R_X86_64_PLT32    this_cpu_cmpxchg16b_emu-0x4
>     53cf:    75 bb                    jne    538c <kmem_cache_alloc+0x5c>
>
> to:
>
>     53b3:    41 8b 44 24 28           mov    0x28(%r12),%eax
>     53b8:    49 8b 34 24              mov    (%r12),%rsi
>     53bc:    48 8d 4a 40              lea    0x40(%rdx),%rcx
>     53c0:    49 8b 1c 07              mov    (%r15,%rax,1),%rbx
>     53c4:    4c 89 f8                 mov    %r15,%rax
>     53c7:    e8 00 00 00 00           call   53cc <kmem_cache_alloc+0x9c>
>             53c8: R_X86_64_PLT32    this_cpu_cmpxchg16b_emu-0x4
>     53cc:    75 be                    jne    538c <kmem_cache_alloc+0x5c>
>
> where an effective reg-reg move "lea (%rdi), %rsi" at 537c gets
> removed. And indeed, GCC figures out that %rsi holds the address of
> the variable and emits:
>
>    5:    65 48 0f c7 0e           cmpxchg16b %gs:(%rsi)
>
> alternative replacement.
>
> Now, here comes the best part: We can get rid of the %P modifier. With
> named address spaces (__seg_gs), older GCCs had some problems with %P
> and emitted "%gs:foo" instead of foo, resulting in "Warning: segment
> override on `lea' is ineffectual" assembly warning. With the proposed
> change, we use:
>
> --cut here--
> int __seg_gs g;
>
> void foo (void)
> {
>   asm ("%0 %1" :: "m"(g), "S"(&g));
> }
> --cut here--
>
> and get the desired assembly:
>
>        movl    $g, %esi
>        %gs:g(%rip) %rsi
>
> The above is also in line with [1], where it is said that
> "[__seg_gs/__seg_fs] address spaces are not considered to be subspaces
> of the generic (flat) address space." So, cmpxchg16b_emu.S must use
> %gs to apply segment base offset, which it does.
>
> > Note that I think this is particularly true of the slub code, because
> > afaik, the slub code will *only* use the slow call-out.
> >
> > Why? Because if the CPU actually supports the cmpxchgb16 instruction,
> > then the slub code won't even take this path at all - it will do the
> > __CMPXCHG_DOUBLE path, which does an unconditional locked cmpxchg16b.
> >
> > Maybe I'm misreading it. And no, none of this matters. But since I saw
> > the patch fly by, and slub.o mentioned, I thought I'd point out how
> > silly this all is. It's optimizing a code-path that is basically never
> > taken, and when it *is* taken, it can be improved further, I think.
>
> True, but as mentioned above, the slub.o code was used to illustrate
> the effect of the patch. The new locking primitive should be usable in
> a general way and could be also used in other places.
>
> [1] https://gcc.gnu.org/onlinedocs/gcc/Named-Address-Spaces.html#x86-Named-Address-Spaces
>
> Uros.
diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
index a87db6140fe2..331a9d4dce82 100644
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -242,14 +242,15 @@ do {									\
 	old__.var = _oval;						\
 	new__.var = _nval;						\
 									\
-	asm qual (ALTERNATIVE("leal %P[var], %%esi; call this_cpu_cmpxchg8b_emu", \
+	asm qual (ALTERNATIVE("call this_cpu_cmpxchg8b_emu",		\
 			      "cmpxchg8b " __percpu_arg([var]), X86_FEATURE_CX8) \
 		  : [var] "+m" (_var),					\
 		    "+a" (old__.low),					\
 		    "+d" (old__.high)					\
 		  : "b" (new__.low),					\
-		    "c" (new__.high)					\
-		  : "memory", "esi");					\
+		    "c" (new__.high),					\
+		    "S" (&_var)						\
+		  : "memory");						\
 									\
 	old__.var;							\
 })
@@ -271,7 +272,7 @@ do {									\
 	old__.var = *_oval;						\
 	new__.var = _nval;						\
 									\
-	asm qual (ALTERNATIVE("leal %P[var], %%esi; call this_cpu_cmpxchg8b_emu", \
+	asm qual (ALTERNATIVE("call this_cpu_cmpxchg8b_emu",		\
 			      "cmpxchg8b " __percpu_arg([var]), X86_FEATURE_CX8) \
 		  CC_SET(z)						\
 		  : CC_OUT(z) (success),				\
@@ -279,8 +280,9 @@ do {									\
 		    "+a" (old__.low),					\
 		    "+d" (old__.high)					\
 		  : "b" (new__.low),					\
-		    "c" (new__.high)					\
-		  : "memory", "esi");					\
+		    "c" (new__.high),					\
+		    "S" (&_var)						\
+		  : "memory");						\
 	if (unlikely(!success))						\
 		*_oval = old__.var;					\
 	likely(success);						\
@@ -309,14 +311,15 @@ do {									\
 	old__.var = _oval;						\
 	new__.var = _nval;						\
 									\
-	asm qual (ALTERNATIVE("leaq %P[var], %%rsi; call this_cpu_cmpxchg16b_emu", \
+	asm qual (ALTERNATIVE("call this_cpu_cmpxchg16b_emu",		\
 			      "cmpxchg16b " __percpu_arg([var]), X86_FEATURE_CX16) \
 		  : [var] "+m" (_var),					\
 		    "+a" (old__.low),					\
 		    "+d" (old__.high)					\
 		  : "b" (new__.low),					\
-		    "c" (new__.high)					\
-		  : "memory", "rsi");					\
+		    "c" (new__.high),					\
+		    "S" (&_var)						\
+		  : "memory");						\
 									\
 	old__.var;							\
 })
@@ -338,7 +341,7 @@ do {									\
 	old__.var = *_oval;						\
 	new__.var = _nval;						\
 									\
-	asm qual (ALTERNATIVE("leaq %P[var], %%rsi; call this_cpu_cmpxchg16b_emu", \
+	asm qual (ALTERNATIVE("call this_cpu_cmpxchg16b_emu",		\
 			      "cmpxchg16b " __percpu_arg([var]), X86_FEATURE_CX16) \
 		  CC_SET(z)						\
 		  : CC_OUT(z) (success),				\
@@ -346,8 +349,9 @@ do {									\
 		    "+a" (old__.low),					\
 		    "+d" (old__.high)					\
 		  : "b" (new__.low),					\
-		    "c" (new__.high)					\
-		  : "memory", "rsi");					\
+		    "c" (new__.high),					\
+		    "S" (&_var)						\
+		  : "memory");						\
 	if (unlikely(!success))						\
 		*_oval = old__.var;					\
 	likely(success);						\

[Index of Archives]     [Linux Stable Commits]     [Linux Stable Kernel]     [Linux Kernel]     [Linux USB Devel]     [Linux Video &Media]     [Linux Audio Users]     [Yosemite News]     [Linux SCSI]

  Powered by Linux