On Tue, 4 Jul 2023 at 10:47, Peter Zijlstra wrote: > > On Mon, Jul 03, 2023 at 03:20:31PM -0400, Olivier Dion wrote: > > > int x = 0; > > int y = 0; > > int r0, r1; > > > > int dummy; > > > > void t0(void) > > { > > __atomic_store_n(&x, 1, __ATOMIC_RELAXED); > > > > __atomic_exchange_n(&dummy, 1, __ATOMIC_SEQ_CST); > > __atomic_thread_fence(__ATOMIC_SEQ_CST); > > > > r0 = __atomic_load_n(&y, __ATOMIC_RELAXED); > > } > > > > void t1(void) > > { > > __atomic_store_n(&y, 1, __ATOMIC_RELAXED); > > __atomic_thread_fence(__ATOMIC_SEQ_CST); > > r1 = __atomic_load_n(&x, __ATOMIC_RELAXED); > > } > > > > // BUG_ON(r0 == 0 && r1 == 0) > > > > On x86-64 (gcc 13.1 -O2) we get: > > > > t0(): > > movl $1, x(%rip) > > movl $1, %eax > > xchgl dummy(%rip), %eax > > lock orq $0, (%rsp) ;; Redundant with previous exchange. > > movl y(%rip), %eax > > movl %eax, r0(%rip) > > ret > > t1(): > > movl $1, y(%rip) > > lock orq $0, (%rsp) > > movl x(%rip), %eax > > movl %eax, r1(%rip) > > ret > > So I would expect the compilers to do better here. It should know those > __atomic_thread_fence() thingies are superfluous and simply not emit > them. This could even be done as a peephole pass later, where it sees > consecutive atomic ops and the second being a no-op. Right, I don't see why we need a whole set of new built-ins that say "this fence isn't needed if the adjacent atomic op already implies a fence". If the adjacent atomic op already implies a fence for a given ISA, then the compiler should already be able to elide the explicit fence. So just write your code with the explicit fence, and rely on the compiler to optimize it properly. Admittedly, today's compilers don't do that optimization well, but they also don't support your proposed built-ins, so you're going to have to wait for compilers to make improvements either way. https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4455.html discusses that compilers could (and should) optimize around atomics better. > > > On x86-64 (clang 16 -O2) we get: > > > > t0(): > > movl $1, x(%rip) > > movl $1, %eax > > xchgl %eax, dummy(%rip) > > mfence ;; Redundant with previous exchange. > > And that's just terrible :/ Nobody should be using MFENCE for this. And > using MFENCE after a LOCK prefixes instruction (implicit in this case) > is just fail, because I don't think C++ atomics cover MMIO and other > such 'lovely' things. > > > movl y(%rip), %eax > > movl %eax, r0(%rip) > > retq > > t1(): > > movl $1, y(%rip) > > mfence > > movl x(%rip), %eax > > movl %eax, r1(%rip) > > retq >