Hi all! This RFC patchset adds kernel support for BPF load-acquire and store-release instructions (for background, please see [1]). Currently only arm64 is supported for RFC. The corresponding LLVM changes can be found at: https://github.com/llvm/llvm-project/pull/108636 As discussed on GitHub [2], define both load-acquire and store-release as BPF_STX | BPF_ATOMIC instructions. The following new flags are introduced: BPF_ATOMIC_LOAD 0x10 BPF_ATOMIC_STORE 0x20 BPF_RELAXED 0x0 BPF_ACQUIRE 0x1 BPF_RELEASE 0x2 BPF_ACQ_REL 0x3 BPF_SEQ_CST 0x4 BPF_LOAD_ACQ (BPF_ATOMIC_LOAD | BPF_ACQUIRE) BPF_STORE_REL (BPF_ATOMIC_STORE | BPF_RELEASE) Bit 4-7 of 'imm' encodes the new atomic operations (load and store), and bit 0-3 specifies the memory order. A load-acquire is a BPF_STX | BPF_ATOMIC instruction with 'imm' set to BPF_LOAD_ACQ (0x11). Similarly, a store-release is a BPF_STX | BPF_ATOMIC instruction with 'imm' set to BPF_STORE_REL (0x22). For bit 4-7 of 'imm' we need to avoid conflicts with existing BPF_STX | BPF_ATOMIC instructions. Currently the following values (a subset of BPFArithOp<>) are in use: def BPF_ADD : BPFArithOp<0x0>; def BPF_OR : BPFArithOp<0x4>; def BPF_AND : BPFArithOp<0x5>; def BPF_XOR : BPFArithOp<0xa>; def BPF_XCHG : BPFArithOp<0xe>; def BPF_CMPXCHG : BPFArithOp<0xf>; 0x1 and 0x2 were chosen for the new instructions because: * BPFArithOp<0x1> is BPF_SUB. Compilers already handle atomic subtraction by generating a BPF NEG followed by a BPF ADD instruction. * BPFArithOp<0x2> is BPF_MUL, and we do not have a plan for adding BPF atomic multiplication instructions. So we think by choosing 0x1 and 0x2, we can avoid having conflicts with BPFArithOp<> in the future. Previously 0xb was chosen because we will never need BPF_MOV (BPFArithOp<0xb>) for BPF_ATOMIC. Please suggest if you think different values should be used. Based on [3], the BPF load-acquire, the arm64 JIT compiler generates LDAR (RCsc) instead of LDAPR (RCpc). Will Deacon also suggested LDAR over LDAPR in an offlist conversation for the following reasons: a. Not all CPUs support LDAPR, as also pointed out in Paul E. McKenney's email (search for "older ARM64 hardware" in [3]). b. The extra ordering provided by RCsc is important in some use cases e.g. locks. c. The arm64 ISA does not provide e.g. other atomic memory operations in RCpc. In other words, it is not worth losing the extra ordering that LDAR provides, if we would still be using RCsc for all other cases. Unlike existing atomic operations that only support BPF_W (32-bit) and BPF_DW (64-bit) size modifiers, load-acquires and store-releases also support BPF_B (8-bit) and BPF_H (16-bit). An 8- or 16-bit load-acquire zero-extends the value before writing it to a 32-bit register, just like LDARH and friends. Examples of using the new instructions (assuming little-endian): long foo(long *ptr) { return __atomic_load_n(ptr, __ATOMIC_ACQUIRE); } Using clang -mcpu=v4, foo() can be compiled to: db 10 00 00 11 00 00 00 r0 = load_acquire((u64 *)(r1 + 0x0)) 95 00 00 00 00 00 00 00 exit opcode (0xdb): BPF_ATOMIC | BPF_DW | BPF_STX imm (0x00000011): BPF_LOAD_ACQ For arm64, an LDAR instruction would be generated by the JIT compiler for the above, e.g.: ldar x7, [x0] Similarly, consider this 16-bit store-release: void bar(short *ptr, short val) { __atomic_store_n(ptr, val, __ATOMIC_RELEASE); } bar() can be compiled to (again, using clang -mcpu=v4): cb 21 00 00 22 00 00 00 store_release((u16 *)(r1 + 0x0), w2) 95 00 00 00 00 00 00 00 exit opcode (0xcb): BPF_ATOMIC | BPF_H | BPF_STX imm (0x00000022): BPF_ATOMIC_STORE | BPF_RELEASE An STLRH will be generated for it, e.g.: stlrh w1, [x0] For a complete mapping for ARM64: load-acquire 8-bit LDARB (BPF_LOAD_ACQ) 16-bit LDARH 32-bit LDAR (32-bit) 64-bit LDAR (64-bit) store-release 8-bit STLRB (BPF_STORE_REL) 16-bit STLRH 32-bit STLR (32-bit) 64-bit STLR (64-bit) Using in arena is supported. Inline assembly is also supported. For example: asm volatile("%0 = load_acquire((u64 *)(%1 + 0x0))" : "=r"(ret) : "r"(ptr) : "memory"); A new pre-defined macro, __BPF_FEATURE_LOAD_ACQ_STORE_REL, can be used to detect if clang supports BPF load-acquire and store-release. Please refer to individual kernel patches (and LLVM commits) for details. Any suggestions or corrections would be much appreciated! [1] https://lore.kernel.org/all/20240729183246.4110549-1-yepeilin@xxxxxxxxxx/ [2] https://github.com/llvm/llvm-project/pull/108636#issuecomment-2389403477 [3] https://lore.kernel.org/bpf/75d1352e-c05e-4fdf-96bf-b1c3daaf41f0@paulmck-laptop/ Thanks, Peilin Ye (4): bpf/verifier: Factor out check_load() bpf: Introduce load-acquire and store-release instructions selftests/bpf: Delete duplicate verifier/atomic_invalid tests selftests/bpf: Add selftests for load-acquire and store-release instructions arch/arm64/include/asm/insn.h | 8 ++ arch/arm64/lib/insn.c | 34 +++++++ arch/arm64/net/bpf_jit.h | 20 +++++ arch/arm64/net/bpf_jit_comp.c | 85 +++++++++++++++++- include/linux/filter.h | 2 + include/uapi/linux/bpf.h | 13 +++ kernel/bpf/core.c | 41 ++++++++- kernel/bpf/disasm.c | 14 +++ kernel/bpf/verifier.c | 88 ++++++++++++------- tools/include/uapi/linux/bpf.h | 13 +++ .../selftests/bpf/prog_tests/arena_atomics.c | 61 ++++++++++++- .../selftests/bpf/prog_tests/atomics.c | 57 +++++++++++- .../selftests/bpf/progs/arena_atomics.c | 62 ++++++++++++- tools/testing/selftests/bpf/progs/atomics.c | 62 ++++++++++++- .../selftests/bpf/verifier/atomic_invalid.c | 28 +++--- .../selftests/bpf/verifier/atomic_load.c | 71 +++++++++++++++ .../selftests/bpf/verifier/atomic_store.c | 70 +++++++++++++++ 17 files changed, 672 insertions(+), 57 deletions(-) create mode 100644 tools/testing/selftests/bpf/verifier/atomic_load.c create mode 100644 tools/testing/selftests/bpf/verifier/atomic_store.c -- 2.47.1.613.gc27f4b7a9f-goog