On Fri, Sep 8, 2023 at 3:07 PM Tejun Heo <tj@xxxxxxxxxx> wrote: > > Hello, > > On Fri, Sep 08, 2023 at 01:26:11PM -0700, Josh Don wrote: > > I'm writing BPF programs for scheduling (ie. sched_ext), so these are > > getting invoked in hot paths and invoked concurrently across multiple > > cpus (for example, pick_next_task, enqueue_task, etc.). The kernel is > > responsible for relaying ground truth, userspace makes O(ms) > > scheduling decisions, and BPF makes O(us) scheduling decisions. > > BPF-BPF concurrency is possible with spinlocks and RMW, BPF-userspace > > can currently only really use RMW. My line of questioning is more > > forward looking, as I'm preemptively thinking of how to ensure > > kernel-like scheduling performance, since BPF spinlock or RMW is > > sometimes overkill :) I would think that barrier() and smp_mb() would > > probably be the minimum viable set (at least for x86) that people > > would find useful, but maybe others can chime in. > > My personal favorite set is store_release/load_acquire(). I have a hard time > thinking up cases which can't be covered by them and they're basically free > on x86. First of all, Thanks Josh for highlighting this topic and gently nudging Paul to continue his work :) It's absolutely essential for BPF to have a well defined memory model. It's necessary for fast sched-ext bpf progs and for HW offloads too. As a minimum we need to document it in Documentation/bpf/standardization/. It's much more challenging than it looks. Unlike traditional ISAs. We cannot say that memory consistency is similar to x86 or arm64 or riscv. bpf memory consistency cannot pick the lower common denominator either. bpf memory model most likely going to be pretty close to kernel memory model instead of HW or C. In parallel we can start adding new concurrency primitives. Sounds like smp_load_acquire()/store_release should be the first pair. Here it's also more challenging than in the kernel. We cannot define bpf_smp_load_acquire() as a macro. It needs to be a new flavor of BPF_LDX instruction that JITs will convert into a proper sequence of insns. On x86-64 it will remain normal load, while on arm64 it will be LDAR instead of LDR and so on. Some of the barriers we can implement as kfuncs since they're slow anyway. Some other barriers would need to be new instructions too. The design would need to take into account multiple architectures, gcc/llvm consideration, verifier complexity, and, of course, include bpf IETF standardization working group.