Re: [RFC] Bridging the gap between the Linux Kernel Memory Consistency Model (LKMM) and C11/C++11 atomics

Jonathan Wakely <jwakely.gcc@xxxxxxxxx> · Tue, 4 Jul 2023 11:23:46 +0100

On Tue, 4 Jul 2023 at 10:47, Peter Zijlstra wrote:
>
> On Mon, Jul 03, 2023 at 03:20:31PM -0400, Olivier Dion wrote:
>
> >   int x = 0;
> >   int y = 0;
> >   int r0, r1;
> >
> >   int dummy;
> >
> >   void t0(void)
> >   {
> >           __atomic_store_n(&x, 1, __ATOMIC_RELAXED);
> >
> >           __atomic_exchange_n(&dummy, 1, __ATOMIC_SEQ_CST);
> >           __atomic_thread_fence(__ATOMIC_SEQ_CST);
> >
> >           r0 = __atomic_load_n(&y, __ATOMIC_RELAXED);
> >   }
> >
> >   void t1(void)
> >   {
> >           __atomic_store_n(&y, 1, __ATOMIC_RELAXED);
> >           __atomic_thread_fence(__ATOMIC_SEQ_CST);
> >           r1 = __atomic_load_n(&x, __ATOMIC_RELAXED);
> >   }
> >
> >   // BUG_ON(r0 == 0 && r1 == 0)
> >
> > On x86-64 (gcc 13.1 -O2) we get:
> >
> >   t0():
> >           movl    $1, x(%rip)
> >           movl    $1, %eax
> >           xchgl   dummy(%rip), %eax
> >           lock orq $0, (%rsp)       ;; Redundant with previous exchange.
> >           movl    y(%rip), %eax
> >           movl    %eax, r0(%rip)
> >           ret
> >   t1():
> >           movl    $1, y(%rip)
> >           lock orq $0, (%rsp)
> >           movl    x(%rip), %eax
> >           movl    %eax, r1(%rip)
> >           ret
>
> So I would expect the compilers to do better here. It should know those
> __atomic_thread_fence() thingies are superfluous and simply not emit
> them. This could even be done as a peephole pass later, where it sees
> consecutive atomic ops and the second being a no-op.

Right, I don't see why we need a whole set of new built-ins that say
"this fence isn't needed if the adjacent atomic op already implies a
fence". If the adjacent atomic op already implies a fence for a given
ISA, then the compiler should already be able to elide the explicit
fence.

So just write your code with the explicit fence, and rely on the
compiler to optimize it properly. Admittedly, today's compilers don't
do that optimization well, but they also don't support your proposed
built-ins, so you're going to have to wait for compilers to make
improvements either way.

https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4455.html
discusses that compilers could (and should) optimize around atomics
better.

>
> > On x86-64 (clang 16 -O2) we get:
> >
> >   t0():
> >           movl    $1, x(%rip)
> >           movl    $1, %eax
> >           xchgl   %eax, dummy(%rip)
> >           mfence                    ;; Redundant with previous exchange.
>
> And that's just terrible :/ Nobody should be using MFENCE for this. And
> using MFENCE after a LOCK prefixes instruction (implicit in this case)
> is just fail, because I don't think C++ atomics cover MMIO and other
> such 'lovely' things.
>
> >           movl    y(%rip), %eax
> >           movl    %eax, r0(%rip)
> >           retq
> >   t1():
> >           movl    $1, y(%rip)
> >           mfence
> >           movl    x(%rip), %eax
> >           movl    %eax, r1(%rip)
> >           retq
>