Re: [PATCH] rcu: use try_cmpxchg in check_cpu_stall

Steven Rostedt <rostedt@xxxxxxxxxxx> · Tue, 28 Feb 2023 16:03:24 -0500

On Tue, 28 Feb 2023 20:39:30 +0000
Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> wrote:

> On Tue, Feb 28, 2023 at 04:51:21PM +0100, Uros Bizjak wrote:
> > Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in
> > check_cpu_stall.  x86 CMPXCHG instruction returns success in ZF flag, so
> > this change saves a compare after cmpxchg (and related move instruction in
> > front of cmpxchg).  
> 
> In my codegen, I am not seeing mov instruction before the cmp removed, how
> can that be? The rax has to be populated with a mov before cmpxchg right?
> 
> So try_cmpxchg gives: mov, cmpxchg, cmp, jne
> Where as cmpxchg gives: mov, cmpxchg, mov, jne
> 
> So yeah you got rid of compare, but I am not seeing reduction in moves.
> Either way, I think it is an improvement due to dropping cmp so:

Did you get the above backwards?

Anyway, when looking at the conversion of cmpxchg() to try_cmpxchg() that
Uros sent to me for the ring buffer, the code went from:

0000000000000070 <ring_buffer_record_off>:
      70:       48 8d 4f 08             lea    0x8(%rdi),%rcx
      74:       8b 57 08                mov    0x8(%rdi),%edx
      77:       89 d6                   mov    %edx,%esi
      79:       89 d0                   mov    %edx,%eax
      7b:       81 ce 00 00 10 00       or     $0x100000,%esi
      81:       f0 0f b1 31             lock cmpxchg %esi,(%rcx)
      85:       39 d0                   cmp    %edx,%eax
      87:       75 eb                   jne    74 <ring_buffer_record_off+0x4>
      89:       e9 00 00 00 00          jmp    8e <ring_buffer_record_off+0x1e>
                        8a: R_X86_64_PLT32      __x86_return_thunk-0x4
      8e:       66 90                   xchg   %ax,%ax

  To

00000000000001a0 <ring_buffer_record_off>:
     1a0:       8b 47 08                mov    0x8(%rdi),%eax
     1a3:       48 8d 4f 08             lea    0x8(%rdi),%rcx
     1a7:       89 c2                   mov    %eax,%edx
     1a9:       81 ca 00 00 10 00       or     $0x100000,%edx
     1af:       f0 0f b1 57 08          lock cmpxchg %edx,0x8(%rdi)
     1b4:       75 05                   jne    1bb <ring_buffer_record_off+0x1b>
     1b6:       e9 00 00 00 00          jmp    1bb <ring_buffer_record_off+0x1b>
                        1b7: R_X86_64_PLT32     __x86_return_thunk-0x4
     1bb:       89 c2                   mov    %eax,%edx
     1bd:       81 ca 00 00 10 00       or     $0x100000,%edx
     1c3:       f0 0f b1 11             lock cmpxchg %edx,(%rcx)
     1c7:       75 f2                   jne    1bb <ring_buffer_record_off+0x1b>
     1c9:       e9 00 00 00 00          jmp    1ce <ring_buffer_record_off+0x2e>
                        1ca: R_X86_64_PLT32     __x86_return_thunk-0x4
     1ce:       66 90                   xchg   %ax,%ax

It does add a bit more code, but the fast path seems better (where the
cmpxchg succeeds). That would be:

00000000000001a0 <ring_buffer_record_off>:
     1a0:       8b 47 08                mov    0x8(%rdi),%eax
     1a3:       48 8d 4f 08             lea    0x8(%rdi),%rcx
     1a7:       89 c2                   mov    %eax,%edx
     1a9:       81 ca 00 00 10 00       or     $0x100000,%edx
     1af:       f0 0f b1 57 08          lock cmpxchg %edx,0x8(%rdi)
     1b4:       75 05                   jne    1bb <ring_buffer_record_off+0x1b>
     1b6:       e9 00 00 00 00          jmp    1bb <ring_buffer_record_off+0x1b>
                        1b7: R_X86_64_PLT32     __x86_return_thunk-0x4

Where there's only two moves and no cmp, where the former has 3 moves and a
cmp in the fast path.

-- Steve