Hello! This is an attempted summary of the discussions on the memory-ordering properties of the upcoming BPF load-acquire instruction. Please reply to the group calling out any errors, omissions, or other commentary. TL;DR: I am sticking with my position that the BPF load-acquire instruction should have the weaker RCpc semantics (like ldapr, not ldar). That said, it is entirely reasonable for ARM64 JITs to use the stronger RCsc ldar instruction to implement the BPF load-acquire instruction. Background and details: o The BPF load-acquire instruction might have RCsc ordering (like the ARM64 ldar instruction) or RCpc ordering (like the ARM64 ldapr instruction). One key difference between ldar and ldapr is that the ldar is ordered against prior stlr instruction, but ldapr is not. Note well that there is only the one ARM64 store-release instruction, stlr. This instruction pairs equally well with ldar and ldapr. o The stronger semantics for the ldar instruction were added on the advice of Herb Sutter of Microsoft. The weaker semantics for ldapr were added on the advice of other Microsoft employees who actuallly write performance-critical concurrent code, but for mid-range CPUs that do some reordering but not so much speculation. (High-end CPUs that do serious reordering and also serious speculation don't usually care much about the difference in ordering semantics, outside of benchmarks specially crafted to demonstrate the difference.) (Perhaps of historical interest, this mirrors the advice for C's and C++'s non-SC atomics. Herb passionately advocated for atomics to be only SC, but an even more passionate group elsewhere within Microsoft reached out privately to register support for non-SC atomics in the strongest terms possible. So C and C++ had non-SC atomics from the get-go.) o The compilers do not guarantee RCsc, only RCpc. Attempts to provide stronger (and thus perhaps more expensive) RCsc semantics for the BPF load-acquire instruction can therefore be defeated by perfectly legal and reasonable compiler memory-reference-reordering optimizations. o The ARM64 ldar instruction was available first. This means that any ARM64 JIT that emits ldapr for BPF load-acquire instructions must be prepared to emit ldar on older ARM64 hardware that does not support ldapr. o If BPF is to JIT efficiently to PowerPC, BPF's load-acquire instruction must be implementable as ld;lwsync. This has similar RCpc memory-ordering semantics as ARM64's ldapr instruction. In contrast, an RCsc load-acquire instruction (like ARM64 ldar) would require sync;ld;lwsync. The difference is that the sync instruction has global scope (its action covers the full system), while lwsync can be handled within the confines of the CPU's local store buffer. The sync instruction is thus considerably more expensive than is the lwsync instruction. o The fact that the Linux kernel runs reliably on PowerPC when using "ld;lwsync" for smp_load_acquire() provides evidence that ARM64 could safely use ldapr for smp_load_acquire() in common code. However: o The fact that older hardware does not support ldapr and the fact that distros strongly prefer a single Linux-kernel image per architecture means that use of ldapr for smp_load_acquire() would likely require yet more boot-time binary rewriting, and might restrict migration of guest OSes from one ARM64 hardware system to another. o There might well be uses of smp_load_acquire() in ARM64 architecture-specific code that need to emit the stronger ldar instruction. o There are way more ARM64 systems than PowerPC systems. It is entirely possible that PowerPC is just getting lucky with its use of "ld;lwsync". However, this applies to BPF programs just as surely as it does to core Linux-kernel code. Risk of failures due to use of the weaker RCpc instruction sequence is lower on PowerPC than on ARM64. o All this leads me to stick with my position called out in TL;DR above, namely that the BPF load-acquire instruction should have the weaker RCpc semantics (like ldapr, not ldar). That said, it is entirely reasonable for ARM64 JITs to use the stronger RCsc ldar instruction to implement the BPF load-acquire instruction. Thoughts? Thanx, Paul