Re: [RFC] Bridging the gap between the Linux Kernel Memory Consistency Model (LKMM) and C11/C++11 atomics

Jonas Oberhauser <jonas.oberhauser@xxxxxxxxxxxxxxx> · Fri, 7 Jul 2023 12:40:07 +0200

Hi all,

Am 7/3/2023 um 9:20 PM schrieb Olivier Dion:
Hi all,

This is a request for comments on extending the atomic builtins API to
help avoiding redundant memory barriers.  Indeed, there are
discrepancies between the Linux kernel consistency memory model (LKMM)
and the C11/C++11 memory consistency model [0].  For example,
fully-ordered atomic operations like xchg and cmpxchg success in LKMM
have implicit memory barriers before/after the operations [1-2], while
atomic operations using the __ATOMIC_SEQ_CST memory order in C11/C++11
do not have any ordering guarantees of an atomic thread fence
__ATOMIC_SEQ_CST with respect to other non-SEQ_CST operations [3].

The issues run quite a bit deeper than this. The two models have two 
completely different perspectives that are quite much incompatible.
I think all you can really do is bridge the gap at the level of the 
generated assembly.
I.e., don't bridge the gap between LKMM and the C11 MCM. Bridge the gap 
between the assembly code generated by C11 atomics and the one generated 
by LKMM. But I'm not sure that's really the task here.

[...] For example, to make Read-Modify-Write (RMW) operations match
the Linux kernel "full barrier before/after" semantics, the liburcu's
uatomic API has to emit both a SEQ_CST RMW operation and a subsequent
thread fence SEQ_CST, which leads to duplicated barriers in some cases.

Does it have to though? Can't you just do e.g. an release RMW operation 
followed by an after_atomic  fence?
And for loads, a SEQ_CST fence followed by an acquire load? Analogously 
(but: mirrored) for stores.

   // Always emit thread fence.
   __atomic_thread_fence_{before,after}_load(int load_memorder,
                                             int fence_memorder)

   // NOP for store_memorder == SEQ_CST.
   // Otherwise, emit thread fence.
   __atomic_thread_fence_{before,after}_store(int store_memorder,
                                              int fence_memorder)

    // NOP for clear_memorder == SEQ_CST.
    // Otherwise, emit thread fence.
   __atomic_thread_fence_{before,after}_clear(int clear_memorder,
                                              int fence_memorder)

    // Always NOP.
   __atomic_thread_fence_{before,after}_rmw(int rmw_memorder,
                                            int fence_memorder)

I currently don't feel comfortable adding such extensions to LKMM (or a 
compiler API for that matter).

You mentioned that the goal is to check some code written using LKMM 
primitives with TSAN due to some formal requirements. What exactly do 
these requirements entail? Do you need to check the code exactly as it 
will be executed (modulo the TSAN instrumentation)? Is it an option to 
map to normal builtins with suboptimal performance just for the 
verification purpose, but then run the slightly more optimized original 
code later?

Specifically for TSAN's ordering requirements, you may need to make 
LKMM's RMWs into acq+rel with an extra mb, even if all that extra 
ordering isn't necessary at the assembler level.

Also note that no matter what you do, due to the two different 
perspectives, TSAN's hb relation may introduce false positive data races 
w.r.t. LKMM.  For example, if the happens-before ordering is guaranteed 
through pb starting with coe/fre.

Without thinking too hard, it seems to me no matter what fences and 
barriers you introduce, TSAN will not see this kind of ordering and 
consider the situation a data race.

(And at least in our own verification methodology for rcu/smr, ordering 
through fre appears when the rscs reads something that is later 
overwritten by the writer. Not sure off the top of my head if this fre 
ordering is what prevents the data race though, or there's some 
additional ordering that TSAN may also detect.

As a side note, according to LKMM we would also have data races in our 
critical sections, but I believe use of rcu_dereference would fix that, 
so you may not experience such data races in your code).

best wishes,
jonas