Hi Paul, I was chatting with Dave Marchevsky about the BPF memory model, and had some followup questions you might be able to answer. I've been using the built-in RMW operations to do a lot of lockless programming, for concurrent BPF-BPF, but also especially for userspace-BPF (the latter of which has become a lot more interesting with the sched_ext work from Meta). It would of course be nice to sometimes lower the synchronization overhead to a hardware barrier or a compiler barrier, to allow for general use acquire/release semantics (rather than needing to fall back to a lock RMW instruction). I saw your presentation from 2021 on this topic here: https://lpc.events/event/11/contributions/941/attachments/859/1667/bpf-memory-model.2020.09.22a.pdf Has there been any further interest in supporting additional kernel-style atomics in BPF that you know of? And on a different BPF note, one thing I wasn't sure about was the ability of the cpu to reorder loads and stores across the BPF program call boundary. For example, could the load of "z" in the BPF program below be reordered before the store to x in the kernel? I'm sure that no compiler barrier is ever necessary here since the BPF program is compiled separately from the kernel, but I'm not sure whether a hardware barrier is necessary. <kernel> x = 3 call_bpf(); <bpf> int y = z; Best, Josh