Supporting New Memory Barrier Types in BPF

Peilin Ye <yepeilin@xxxxxxxxxx> · Mon, 29 Jul 2024 18:32:44 +0000

Hi list!

As we are looking at running sched_ext-style BPF scheduling on architectures
with a more relaxed memory model (i.e. ARM), we would like to:

  1. have fine-grained control over memory ordering in BPF (instead of
     defaulting to a full barrier), for performance reasons
  2. pay closer attention to if memory barriers are being used correctly in
     BPF

To that end, our main goal here is to support more types of memory barriers in
BPF.  While Paul E. McKenney et al. are working on the formalized BPF memory
model [1], Paul agreed that it makes sense to support some basic types first.
Additionally, we noticed an issue with the __sync_*fetch*() compiler built-ins
related to memory ordering, which will be described in details below.

I. We need more types of BPF memory barriers
--------------------------------------------

Currently, when it comes to BPF memory barriers, our choices are effectively
limited to:

  * compiler barrier: 'asm volatile ("" ::: "memory");'
  * full memory barriers implied by compiler built-ins like
    __sync_val_compare_and_swap()

We need more.  During offline discussion with Paul, we agreed we can start
from:

  * load-acquire: __atomic_load_n(... memorder=__ATOMIC_ACQUIRE);
  * store-release: __atomic_store_n(... memorder=__ATOMIC_RELEASE);

Theoretically, the BPF JIT compiler could also reorder instructions just like
Clang or GCC, though it might not currently do so.  If we ever developed a more
optimizing BPF JIT compiler, it would also be nice to have an optimization
barrier for it.  However, Alexei Starovoitov has expressed that defining a BPF
instruction with 'asm volatile ("" ::: "memory");' semantics might be tricky.

II. Implicit barriers can get confusing
---------------------------------------

We noticed that, as a bit of a surprise, the __sync_*fetch*() built-ins do not
always imply a full barrier for BPF on ARM.  For example, when using LLVM, the
frequently-used __sync_fetch_and_add() can either imply "relaxed" (no barrier),
or "acquire and release" (full barrier) semantics, depending on if its return
value is used:

Case (a): return value is used

  SEC("...")
  int64_t foo;

  int64_t func(...) {
      return __sync_fetch_and_add(&foo, 1);
  }

For case (a), Clang gave us:

  3:	db 01 00 00 01 00 00 00	r0 = atomic_fetch_add((u64 *)(r1 + 0x0), r0)

  opcode    (0xdb): BPF_STX | BPF_ATOMIC | BPF_DW
  imm (0x00000001): BPF_ADD | BPF_FETCH

Case (b): return value is ignored

  SEC("...")
  int64_t foo;

  int64_t func(...) {
      __sync_fetch_and_add(&foo, 1);

      return foo;
  }

For case (b), Clang gave us:

  3:	db 12 00 00 00 00 00 00	lock *(u64 *)(r2 + 0x0) += r1

  opcode    (0xdb): BPF_STX | BPF_ATOMIC | BPF_DW
  imm (0x00000000): BPF_ADD

LLVM decided to drop BPF_FETCH, since the return value of
__sync_fetch_and_add() is being ignored [2].  Now, if we take a look at
emit_lse_atomic() in the BPF JIT compiler code for ARM64 (suppose that LSE
atomic instructions are being used):

  case BPF_ADD:
          emit(A64_STADD(isdw, reg, src), ctx);
          break;
  <...>
  case BPF_ADD | BPF_FETCH:
          emit(A64_LDADDAL(isdw, src, reg, src), ctx);
          break;

STADD is an alias for LDADD.  According to [3]:

  * LDADDAL for case (a) has "acquire" plus "release" semantics
  * LDADD for case (b) "has neither acquire nor release semantics"

This is pretty non-intuitive; a compiler built-in should not have inconsistent
implications on memory ordering, and it is better not to require all BPF
programmers to memorize this.

GCC seems a bit ambiguous [4] on whether __sync_*fetch*() built-ins should
always imply a full barrier.  GCC considers these __sync_*() built-ins as
"legacy", and introduced a new set of __atomic_*() built-ins ("Memory Model
Aware Atomic Operations") [5] to replace them.  These __atomic_*() built-ins
are designed to be a lot more explicit on memory ordering, for example:

  type __atomic_fetch_add (type *ptr, type val, int memorder)

This requires the programmer to specify a memory order type (relaxed, acquire,
release...) via the "memorder" parameter.  Currently in LLVM, for BPF, those
__atomic_*fetch*() built-ins seem to be aliases to their __sync_*fetch*()
counterparts (the "memorder" parameter seems effectively ignored), and are not
fully supported.

III. Next steps
---------------

Roughly, the scope of this work includes:

  * decide how to extend the BPF ISA (add new instructions and/or extend
    current ones)
  * teach LLVM and GCC to generate the new/extended instructions
  * teach the BPF verifier to understand them
  * teach the BPF JIT compiler to compile them
  * update BPF memory model and tooling
  * update IETF specification

Additionally, for the issue described in the previous section, we need to:

  * check if GCC has the same behavior
  * at least clearly document the implied effects on BPF memory ordering of
    current __sync_*fetch*() built-ins (especially for architectures like ARM),
    as described
  * fully support the new __atomic_*fetch*() built-ins for BPF to replace the
    __sync_*fetch*() ones

Any suggestions or corrections would be most welcome!

Thanks,
Peilin Ye

[1] Instruction-Level BPF Memory Model
https://docs.google.com/document/d/1TaSEfWfLnRUi5KqkavUQyL2tThJXYWHS15qcbxIsFb0/edit?usp=sharing

[2] For more information, see LLVM commit 286daafd6512 ("[BPF] support atomic
    instructions").  Search for "LLVM will check the return value" in the
    commit message.

[3] Arm Architecture Reference Manual for A-profile architecture (ARM DDI
    0487K.a, ID032224), C6.2.149, page 2006

[4] https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html
    6.58 Legacy __sync Built-in Functions for Atomic Memory Access
    "In most cases, these built-in functions are considered a full barrier."

[5] https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html
    6.59 Built-in Functions for Memory Model Aware Atomic Operations