Re: Supporting New Memory Barrier Types in BPF

"Paul E. McKenney" <paulmck@xxxxxxxxxx> · Mon, 29 Jul 2024 20:49:01 -0700

On Mon, Jul 29, 2024 at 06:28:16PM -0700, Alexei Starovoitov wrote:
> On Mon, Jul 29, 2024 at 11:33 AM Peilin Ye <yepeilin@xxxxxxxxxx> wrote:
> >
> > Hi list!
> >
> > As we are looking at running sched_ext-style BPF scheduling on architectures
> > with a more relaxed memory model (i.e. ARM), we would like to:
> >
> >   1. have fine-grained control over memory ordering in BPF (instead of
> >      defaulting to a full barrier), for performance reasons
> >   2. pay closer attention to if memory barriers are being used correctly in
> >      BPF
> >
> > To that end, our main goal here is to support more types of memory barriers in
> > BPF.  While Paul E. McKenney et al. are working on the formalized BPF memory
> > model [1], Paul agreed that it makes sense to support some basic types first.
> > Additionally, we noticed an issue with the __sync_*fetch*() compiler built-ins
> > related to memory ordering, which will be described in details below.
> >
> > I. We need more types of BPF memory barriers
> > --------------------------------------------
> >
> > Currently, when it comes to BPF memory barriers, our choices are effectively
> > limited to:
> >
> >   * compiler barrier: 'asm volatile ("" ::: "memory");'
> >   * full memory barriers implied by compiler built-ins like
> >     __sync_val_compare_and_swap()
> >
> > We need more.  During offline discussion with Paul, we agreed we can start
> > from:
> >
> >   * load-acquire: __atomic_load_n(... memorder=__ATOMIC_ACQUIRE);
> >   * store-release: __atomic_store_n(... memorder=__ATOMIC_RELEASE);
> 
> we would need inline asm equivalent too. Similar to kernel
> smp_load_acquire() macro.
> 
> > Theoretically, the BPF JIT compiler could also reorder instructions just like
> > Clang or GCC, though it might not currently do so.  If we ever developed a more
> > optimizing BPF JIT compiler, it would also be nice to have an optimization
> > barrier for it.  However, Alexei Starovoitov has expressed that defining a BPF
> > instruction with 'asm volatile ("" ::: "memory");' semantics might be tricky.
> 
> It can be a standalone insn that is a compiler barrier only but that feels like
> a waste of an instruction. So depending how we end up encoding various
> real barriers
> there may be a bit to spend in such a barrier insn that is only a
> compiler barrier.
> In this case optimizing JIT barrier.

When reading BPF instructions back into a compiler backend, would
it make sense to convert an acquire-load instruction back to
__atomic_load_n(... memorder=__ATOMIC_ACQUIRE)?  Or the internal
representation thereof?

							Thanx, Paul

> > II. Implicit barriers can get confusing
> > ---------------------------------------
> >
> > We noticed that, as a bit of a surprise, the __sync_*fetch*() built-ins do not
> > always imply a full barrier for BPF on ARM.  For example, when using LLVM, the
> > frequently-used __sync_fetch_and_add() can either imply "relaxed" (no barrier),
> > or "acquire and release" (full barrier) semantics, depending on if its return
> > value is used:
> >
> > Case (a): return value is used
> >
> >   SEC("...")
> >   int64_t foo;
> >
> >   int64_t func(...) {
> >       return __sync_fetch_and_add(&foo, 1);
> >   }
> >
> > For case (a), Clang gave us:
> >
> >   3:    db 01 00 00 01 00 00 00 r0 = atomic_fetch_add((u64 *)(r1 + 0x0), r0)
> >
> >   opcode    (0xdb): BPF_STX | BPF_ATOMIC | BPF_DW
> >   imm (0x00000001): BPF_ADD | BPF_FETCH
> >
> > Case (b): return value is ignored
> >
> >   SEC("...")
> >   int64_t foo;
> >
> >   int64_t func(...) {
> >       __sync_fetch_and_add(&foo, 1);
> >
> >       return foo;
> >   }
> >
> > For case (b), Clang gave us:
> >
> >   3:    db 12 00 00 00 00 00 00 lock *(u64 *)(r2 + 0x0) += r1
> >
> >   opcode    (0xdb): BPF_STX | BPF_ATOMIC | BPF_DW
> >   imm (0x00000000): BPF_ADD
> >
> > LLVM decided to drop BPF_FETCH, since the return value of
> > __sync_fetch_and_add() is being ignored [2].  Now, if we take a look at
> > emit_lse_atomic() in the BPF JIT compiler code for ARM64 (suppose that LSE
> > atomic instructions are being used):
> >
> >   case BPF_ADD:
> >           emit(A64_STADD(isdw, reg, src), ctx);
> >           break;
> >   <...>
> >   case BPF_ADD | BPF_FETCH:
> >           emit(A64_LDADDAL(isdw, src, reg, src), ctx);
> >           break;
> >
> > STADD is an alias for LDADD.  According to [3]:
> >
> >   * LDADDAL for case (a) has "acquire" plus "release" semantics
> >   * LDADD for case (b) "has neither acquire nor release semantics"
> >
> > This is pretty non-intuitive; a compiler built-in should not have inconsistent
> > implications on memory ordering, and it is better not to require all BPF
> > programmers to memorize this.
> >
> > GCC seems a bit ambiguous [4] on whether __sync_*fetch*() built-ins should
> > always imply a full barrier.  GCC considers these __sync_*() built-ins as
> > "legacy", and introduced a new set of __atomic_*() built-ins ("Memory Model
> > Aware Atomic Operations") [5] to replace them.  These __atomic_*() built-ins
> > are designed to be a lot more explicit on memory ordering, for example:
> >
> >   type __atomic_fetch_add (type *ptr, type val, int memorder)
> >
> > This requires the programmer to specify a memory order type (relaxed, acquire,
> > release...) via the "memorder" parameter.  Currently in LLVM, for BPF, those
> > __atomic_*fetch*() built-ins seem to be aliases to their __sync_*fetch*()
> > counterparts (the "memorder" parameter seems effectively ignored), and are not
> > fully supported.
> 
> This sounds like a compiler bug.
> 
> Yonghong, Jose,
> do you know what compilers do for other backends?
> Is it allowed to convert sycn_fetch_add into sync_add when fetch part is unused?
> 
> > III. Next steps
> > ---------------
> >
> > Roughly, the scope of this work includes:
> >
> >   * decide how to extend the BPF ISA (add new instructions and/or extend
> >     current ones)
> 
> ldx/stx insns support MEM and MEMSX modifiers.
> Adding MEM_ACQ_REL feels like a natural fit. Better name?
> 
> For barriers we would need a new insn. Not sure which class would fit the best.
> Maybe BPF_LD ?
> 
> Another alternative for barriers is to use nocsr kfuncs.
> Then we have the freedom to make mistakes and fix them later.
> One kfunc per barrier would do.
> JITs would inline them into appropriate insns.
> In bpf progs they will be used just like in the kernel code smp_mb(),
> smp_rmb(), etc.
> 
> I don't think compilers have to emit barriers from C code, so my
> preference is kfuncs atm.
> 
> >   * teach LLVM and GCC to generate the new/extended instructions
> >   * teach the BPF verifier to understand them
> >   * teach the BPF JIT compiler to compile them
> >   * update BPF memory model and tooling
> >   * update IETF specification
> >
> > Additionally, for the issue described in the previous section, we need to:
> >
> >   * check if GCC has the same behavior
> >   * at least clearly document the implied effects on BPF memory ordering of
> >     current __sync_*fetch*() built-ins (especially for architectures like ARM),
> >     as described
> >   * fully support the new __atomic_*fetch*() built-ins for BPF to replace the
> >     __sync_*fetch*() ones
> >
> > Any suggestions or corrections would be most welcome!
> >
> > Thanks,
> > Peilin Ye
> >
> >
> > [1] Instruction-Level BPF Memory Model
> > https://docs.google.com/document/d/1TaSEfWfLnRUi5KqkavUQyL2tThJXYWHS15qcbxIsFb0/edit?usp=sharing
> >
> > [2] For more information, see LLVM commit 286daafd6512 ("[BPF] support atomic
> >     instructions").  Search for "LLVM will check the return value" in the
> >     commit message.
> >
> > [3] Arm Architecture Reference Manual for A-profile architecture (ARM DDI
> >     0487K.a, ID032224), C6.2.149, page 2006
> >
> > [4] https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html
> >     6.58 Legacy __sync Built-in Functions for Atomic Memory Access
> >     "In most cases, these built-in functions are considered a full barrier."
> >
> > [5] https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html
> >     6.59 Built-in Functions for Memory Model Aware Atomic Operations
> >