Re: [tip: locking/core] locking/atomic/x86: Introduce arch_try_cmpxchg64_local()

Uros Bizjak <ubizjak@xxxxxxxxx> · Wed, 17 Apr 2024 18:24:21 +0200

On Mon, Apr 15, 2024 at 10:12 PM Borislav Petkov <bp@xxxxxxxxx> wrote:

> > locking/atomic/x86: Introduce arch_try_cmpxchg64_local()
> >
> > Introduce arch_try_cmpxchg64_local() for 64-bit and 32-bit targets
> > to improve code using cmpxchg64_local().  On 64-bit targets, the
> > generated assembly improves from:
> >
> >     3e28:     31 c0                   xor    %eax,%eax
> >     3e2a:     4d 0f b1 7d 00          cmpxchg %r15,0x0(%r13)
> >     3e2f:     48 85 c0                test   %rax,%rax
> >     3e32:     0f 85 9f 00 00 00       jne    3ed7 <...>
> >
> > to:
> >
> >     3e28:     31 c0                   xor    %eax,%eax
> >     3e2a:     4d 0f b1 7d 00          cmpxchg %r15,0x0(%r13)
> >     3e2f:     0f 85 9f 00 00 00       jne    3ed4 <...>
> >
> > where a TEST instruction after CMPXCHG is saved.  The improvements
> > for 32-bit targets are even more noticeable, because double-word
> > compare after CMPXCHG8B gets eliminated.
>
> Ok, maybe I'm missing the point here or maybe the commit message doesn't
> explain but how does this big diffstat justify one less insn?

We are dealing with locking primitives, probably the hottest part of
the kernel. For 64-bits, the patch is effectively a couple of lines,
reusing and extending existing macros, the line count for a modern
32-bit target is also a couple of lines, but there the saved insn
count is much higher, around 10 instructions.

What bothers you is the line count of cmpxchg8b emulation, necessary
to handle !CONFIG_X86_CMPXCHG64 targets. They are still alive, and
cmpxchg8b emulation code was recently improved (not by me!) to
correctly handle the ZF flag. The added code just builds on that.

> And no, 32-bit doesn't matter.

Really? Was this decision reached by the community consensus?

The linux kernel has many uses, and using it for servers by a big
company, you are the voice of, is only one part of the whole picture.
I'm sure that 32-bit is quite relevant for embedded users and more
than relevant to a student or an enthusiast in some remote part of the
world. As a maintainer, you should also take care of the communities
that are somehow neglected, where unilateral decisions like the one
above can have unwanted consequences.

> This looks like too crazy micro-optimization to me to be worth the 40
> insertions.

If the line count is the problem, I can easily parametrize new and
existing big macro descriptions in a follow-up patch. However, I was
advised to not mix everything together in one patch, but rest assured,
the creation and testing of the follow-up patch would take me less
time than writing the message you are reading.

> But I could be wrong and I'd gladly read explanations for why I am.

I would understand if someone asked you to write some new
functionality for a seemingly obsolete target. Everybody would
understand your rejection. But this improvement was written by me in
my time, and demonstrated some benefit to the existing code. The patch
is almost trivial and it took maybe 5 minutes of the maintainer's time
to approve it. It brings no future maintenance burden, but it perhaps
improves someone's life a tiny bit.

Last, but not least, I'm bringing some ideas from the compiler
development community, where the attitude to redundant instructions is
totally different. It could take weeks of effort and considerable
rewrite of compiler functionality just to remove one instruction ;)
Micro-optimizations add up!

Thanks for reading this,
Uros.