Re: [RFC] Bridging the gap between the Linux Kernel Memory Consistency Model (LKMM) and C11/C++11 atomics

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, 04 Jul 2023, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> On Mon, Jul 03, 2023 at 03:20:31PM -0400, Olivier Dion wrote:
[...]
>> On x86-64 (gcc 13.1 -O2) we get:
>> 
>>   t0():
>>           movl    $1, x(%rip)
>>           movl    $1, %eax
>>           xchgl   dummy(%rip), %eax
>>           lock orq $0, (%rsp)       ;; Redundant with previous exchange.
>>           movl    y(%rip), %eax
>>           movl    %eax, r0(%rip)
>>           ret
>>   t1():
>>           movl    $1, y(%rip)
>>           lock orq $0, (%rsp)
>>           movl    x(%rip), %eax
>>           movl    %eax, r1(%rip)
>>           ret
>
> So I would expect the compilers to do better here. It should know those
> __atomic_thread_fence() thingies are superfluous and simply not emit
> them. This could even be done as a peephole pass later, where it sees
> consecutive atomic ops and the second being a no-op.

Indeed, a peephole optimization could work for this Dekker, if the
compiler adds the pattern for it.  However, AFAIK, a peephole can not be
applied when the two fences are in different basic blocks.  For example,
only emitting a fence on a compare_exchange success.  This limitation
implies that the optimization can not be done across functions/modules
(shared libraries).  For example, it would be interesting to be able to
promote an acquire fence of a pthread_mutex_lock() to a full fence on
weakly ordered architectures while preventing a redundant fence on
strongly ordered architectures.

We know that at least Clang has such peephole optimizations for some
architecture backends.  It seems however that they do not recognize
lock-prefixed instructions as fence.  AFAIK, GCC does not have that kind
of optimization.

We are also aware that some research has been done on this topic [0].
The idea is to use PRE for elimiation of redundant fences.  This would
work across multiple basic blocks, although the paper focus on
intra-procedural eliminations.  However, it seems that the latest work
on that [1] has never been completed [2].

Our proposed approach provides a mean for the user to express -- and
document -- the wanted semantic in the source code.  This allows the
compiler to only emit wanted fences, therefore not relying on
architecture specific backend optimizations.  In other words, this
applies even on unoptimized binaries.

[...]

	Thanks,
        Olivier

  [0] https://dl.acm.org/doi/10.1145/3033019.3033021

  [1] https://discourse.llvm.org/t/fence-elimination-pass-proposal/33679

  [2] https://reviews.llvm.org/D5758
-- 
Olivier Dion
EfficiOS Inc.
https://www.efficios.com



[Index of Archives]     [Linux Kernel]     [Kernel Newbies]     [x86 Platform Driver]     [Netdev]     [Linux Wireless]     [Netfilter]     [Bugtraq]     [Linux Filesystems]     [Yosemite Discussion]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Device Mapper]

  Powered by Linux