Re: BPF memory model

"Paul E. McKenney" <paulmck@xxxxxxxxxx> · Sat, 9 Sep 2023 05:47:03 -0700

On Fri, Sep 08, 2023 at 04:16:39PM -0700, Alexei Starovoitov wrote:
> On Fri, Sep 8, 2023 at 3:07 PM Tejun Heo <tj@xxxxxxxxxx> wrote:
> >
> > Hello,
> >
> > On Fri, Sep 08, 2023 at 01:26:11PM -0700, Josh Don wrote:
> > > I'm writing BPF programs for scheduling (ie. sched_ext), so these are
> > > getting invoked in hot paths and invoked concurrently across multiple
> > > cpus (for example, pick_next_task, enqueue_task, etc.). The kernel is
> > > responsible for relaying ground truth, userspace makes O(ms)
> > > scheduling decisions, and BPF makes O(us) scheduling decisions.
> > > BPF-BPF concurrency is possible with spinlocks and RMW, BPF-userspace
> > > can currently only really use RMW. My line of questioning is more
> > > forward looking, as I'm preemptively thinking of how to ensure
> > > kernel-like scheduling performance, since BPF spinlock or RMW is
> > > sometimes overkill :) I would think that barrier() and smp_mb() would
> > > probably be the minimum viable set (at least for x86) that people
> > > would find useful, but maybe others can chime in.
> >
> > My personal favorite set is store_release/load_acquire(). I have a hard time
> > thinking up cases which can't be covered by them and they're basically free
> > on x86.
> 
> First of all, Thanks Josh for highlighting this topic and
> gently nudging Paul to continue his work :)

I hereby consider myself nudged.  ;-)

> It's absolutely essential for BPF to have a well defined memory model.
> 
> It's necessary for fast sched-ext bpf progs and for HW offloads too.
> As a minimum we need to document it in Documentation/bpf/standardization/.

Ah, I see that in current mainline.

> It's much more challenging than it looks.
> Unlike traditional ISAs. We cannot say that memory consistency is
> similar to x86 or arm64 or riscv.
> bpf memory consistency cannot pick the lower common denominator either.
> bpf memory model most likely going to be pretty close to kernel memory model
> instead of HW or C.
> In parallel we can start adding new concurrency primitives.

My first thought would be to look at instruction-set.rst in that
directory, and project LKMM onto the concurrency primitives that
are currently defined there.  The advantage of this is "just enough
LKMM" at any given time, but it would also mean that memory-model.rst
(or whatever eventual bikesheded name) would need maintenance as new
concurrency primitives are added.  Which seems like the correct
approach, as opposed to attempting to define memory model concepts
for non-existent concurrency primitives.

Presumably, I also need to run this through the BPF standardization
process.

Or did you have something else in mind?

							Thanx, Paul

> Sounds like smp_load_acquire()/store_release should be the first pair.
> Here it's also more challenging than in the kernel.
> We cannot define bpf_smp_load_acquire() as a macro.
> It needs to be a new flavor of BPF_LDX instruction that JITs
> will convert into a proper sequence of insns.
> On x86-64 it will remain normal load,
> while on arm64 it will be LDAR instead of LDR and so on.
> 
> Some of the barriers we can implement as kfuncs since they're slow anyway.
> Some other barriers would need to be new instructions too.
> The design would need to take into account multiple architectures,
> gcc/llvm consideration, verifier complexity, and,
> of course, include bpf IETF standardization working group.