Re: BPF memory model

Josh Don <joshdon@xxxxxxxxxx> · Fri, 8 Sep 2023 13:26:11 -0700

On Fri, Sep 8, 2023 at 1:43 AM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:
>
> On Thu, Sep 07, 2023 at 03:00:56PM -0700, Josh Don wrote:
> > Has there been any further interest in supporting additional
> > kernel-style atomics in BPF that you know of?
>
> This is one of the first that I have heard of.  ;-)
>
> But what BPF programs are you running that are seeing excessive
> synchronization overhead?  That will tell us which operations to
> start with.  (Or maybe it is time to just add the full Linux-kernel
> atomic-operations kitchen sink, but that would not normally be the way
> to bet.)

I'm writing BPF programs for scheduling (ie. sched_ext), so these are
getting invoked in hot paths and invoked concurrently across multiple
cpus (for example, pick_next_task, enqueue_task, etc.). The kernel is
responsible for relaying ground truth, userspace makes O(ms)
scheduling decisions, and BPF makes O(us) scheduling decisions.
BPF-BPF concurrency is possible with spinlocks and RMW, BPF-userspace
can currently only really use RMW. My line of questioning is more
forward looking, as I'm preemptively thinking of how to ensure
kernel-like scheduling performance, since BPF spinlock or RMW is
sometimes overkill :) I would think that barrier() and smp_mb() would
probably be the minimum viable set (at least for x86) that people
would find useful, but maybe others can chime in.

> > And on a different BPF note, one thing I wasn't sure about was the
> > ability of the cpu to reorder loads and stores across the BPF program
> > call boundary. For example, could the load of "z" in the BPF program
> > below be reordered before the store to x in the kernel? I'm sure that
> > no compiler barrier is ever necessary here since the BPF program is
> > compiled separately from the kernel, but I'm not sure whether a
> > hardware barrier is necessary.
> > <kernel>
> > x = 3
> > call_bpf();
> >   <bpf>
> >   int y = z;
>
> Given that a major goal of BPF is the ability to add low-overhead
> programs to code on fastpaths, I would not expect any implicit barriers
> in that case.  Consider for example counting the number of calls to a
> "hot" function in the Linux kernel, in which case adding full ordering
> would incur unacceptable performance degradation.  I would instead
> expect that the BPF program would need to add explicit barriers or
> ordered RMW operations.

Yep, that was my expectation as well. On the plus, this gives the
flexibility of only adding barriers where they are really needed.

Best,
Josh