On Mon, Jul 29, 2024 at 06:28:16PM -0700, Alexei Starovoitov wrote: > On Mon, Jul 29, 2024 at 11:33 AM Peilin Ye <yepeilin@xxxxxxxxxx> wrote: > > > > Hi list! > > > > As we are looking at running sched_ext-style BPF scheduling on architectures > > with a more relaxed memory model (i.e. ARM), we would like to: > > > > 1. have fine-grained control over memory ordering in BPF (instead of > > defaulting to a full barrier), for performance reasons > > 2. pay closer attention to if memory barriers are being used correctly in > > BPF > > > > To that end, our main goal here is to support more types of memory barriers in > > BPF. While Paul E. McKenney et al. are working on the formalized BPF memory > > model [1], Paul agreed that it makes sense to support some basic types first. > > Additionally, we noticed an issue with the __sync_*fetch*() compiler built-ins > > related to memory ordering, which will be described in details below. > > > > I. We need more types of BPF memory barriers > > -------------------------------------------- > > > > Currently, when it comes to BPF memory barriers, our choices are effectively > > limited to: > > > > * compiler barrier: 'asm volatile ("" ::: "memory");' > > * full memory barriers implied by compiler built-ins like > > __sync_val_compare_and_swap() > > > > We need more. During offline discussion with Paul, we agreed we can start > > from: > > > > * load-acquire: __atomic_load_n(... memorder=__ATOMIC_ACQUIRE); > > * store-release: __atomic_store_n(... memorder=__ATOMIC_RELEASE); > > we would need inline asm equivalent too. Similar to kernel > smp_load_acquire() macro. > > > Theoretically, the BPF JIT compiler could also reorder instructions just like > > Clang or GCC, though it might not currently do so. If we ever developed a more > > optimizing BPF JIT compiler, it would also be nice to have an optimization > > barrier for it. However, Alexei Starovoitov has expressed that defining a BPF > > instruction with 'asm volatile ("" ::: "memory");' semantics might be tricky. > > It can be a standalone insn that is a compiler barrier only but that feels like > a waste of an instruction. So depending how we end up encoding various > real barriers > there may be a bit to spend in such a barrier insn that is only a > compiler barrier. > In this case optimizing JIT barrier. When reading BPF instructions back into a compiler backend, would it make sense to convert an acquire-load instruction back to __atomic_load_n(... memorder=__ATOMIC_ACQUIRE)? Or the internal representation thereof? Thanx, Paul > > II. Implicit barriers can get confusing > > --------------------------------------- > > > > We noticed that, as a bit of a surprise, the __sync_*fetch*() built-ins do not > > always imply a full barrier for BPF on ARM. For example, when using LLVM, the > > frequently-used __sync_fetch_and_add() can either imply "relaxed" (no barrier), > > or "acquire and release" (full barrier) semantics, depending on if its return > > value is used: > > > > Case (a): return value is used > > > > SEC("...") > > int64_t foo; > > > > int64_t func(...) { > > return __sync_fetch_and_add(&foo, 1); > > } > > > > For case (a), Clang gave us: > > > > 3: db 01 00 00 01 00 00 00 r0 = atomic_fetch_add((u64 *)(r1 + 0x0), r0) > > > > opcode (0xdb): BPF_STX | BPF_ATOMIC | BPF_DW > > imm (0x00000001): BPF_ADD | BPF_FETCH > > > > Case (b): return value is ignored > > > > SEC("...") > > int64_t foo; > > > > int64_t func(...) { > > __sync_fetch_and_add(&foo, 1); > > > > return foo; > > } > > > > For case (b), Clang gave us: > > > > 3: db 12 00 00 00 00 00 00 lock *(u64 *)(r2 + 0x0) += r1 > > > > opcode (0xdb): BPF_STX | BPF_ATOMIC | BPF_DW > > imm (0x00000000): BPF_ADD > > > > LLVM decided to drop BPF_FETCH, since the return value of > > __sync_fetch_and_add() is being ignored [2]. Now, if we take a look at > > emit_lse_atomic() in the BPF JIT compiler code for ARM64 (suppose that LSE > > atomic instructions are being used): > > > > case BPF_ADD: > > emit(A64_STADD(isdw, reg, src), ctx); > > break; > > <...> > > case BPF_ADD | BPF_FETCH: > > emit(A64_LDADDAL(isdw, src, reg, src), ctx); > > break; > > > > STADD is an alias for LDADD. According to [3]: > > > > * LDADDAL for case (a) has "acquire" plus "release" semantics > > * LDADD for case (b) "has neither acquire nor release semantics" > > > > This is pretty non-intuitive; a compiler built-in should not have inconsistent > > implications on memory ordering, and it is better not to require all BPF > > programmers to memorize this. > > > > GCC seems a bit ambiguous [4] on whether __sync_*fetch*() built-ins should > > always imply a full barrier. GCC considers these __sync_*() built-ins as > > "legacy", and introduced a new set of __atomic_*() built-ins ("Memory Model > > Aware Atomic Operations") [5] to replace them. These __atomic_*() built-ins > > are designed to be a lot more explicit on memory ordering, for example: > > > > type __atomic_fetch_add (type *ptr, type val, int memorder) > > > > This requires the programmer to specify a memory order type (relaxed, acquire, > > release...) via the "memorder" parameter. Currently in LLVM, for BPF, those > > __atomic_*fetch*() built-ins seem to be aliases to their __sync_*fetch*() > > counterparts (the "memorder" parameter seems effectively ignored), and are not > > fully supported. > > This sounds like a compiler bug. > > Yonghong, Jose, > do you know what compilers do for other backends? > Is it allowed to convert sycn_fetch_add into sync_add when fetch part is unused? > > > III. Next steps > > --------------- > > > > Roughly, the scope of this work includes: > > > > * decide how to extend the BPF ISA (add new instructions and/or extend > > current ones) > > ldx/stx insns support MEM and MEMSX modifiers. > Adding MEM_ACQ_REL feels like a natural fit. Better name? > > For barriers we would need a new insn. Not sure which class would fit the best. > Maybe BPF_LD ? > > Another alternative for barriers is to use nocsr kfuncs. > Then we have the freedom to make mistakes and fix them later. > One kfunc per barrier would do. > JITs would inline them into appropriate insns. > In bpf progs they will be used just like in the kernel code smp_mb(), > smp_rmb(), etc. > > I don't think compilers have to emit barriers from C code, so my > preference is kfuncs atm. > > > * teach LLVM and GCC to generate the new/extended instructions > > * teach the BPF verifier to understand them > > * teach the BPF JIT compiler to compile them > > * update BPF memory model and tooling > > * update IETF specification > > > > Additionally, for the issue described in the previous section, we need to: > > > > * check if GCC has the same behavior > > * at least clearly document the implied effects on BPF memory ordering of > > current __sync_*fetch*() built-ins (especially for architectures like ARM), > > as described > > * fully support the new __atomic_*fetch*() built-ins for BPF to replace the > > __sync_*fetch*() ones > > > > Any suggestions or corrections would be most welcome! > > > > Thanks, > > Peilin Ye > > > > > > [1] Instruction-Level BPF Memory Model > > https://docs.google.com/document/d/1TaSEfWfLnRUi5KqkavUQyL2tThJXYWHS15qcbxIsFb0/edit?usp=sharing > > > > [2] For more information, see LLVM commit 286daafd6512 ("[BPF] support atomic > > instructions"). Search for "LLVM will check the return value" in the > > commit message. > > > > [3] Arm Architecture Reference Manual for A-profile architecture (ARM DDI > > 0487K.a, ID032224), C6.2.149, page 2006 > > > > [4] https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html > > 6.58 Legacy __sync Built-in Functions for Atomic Memory Access > > "In most cases, these built-in functions are considered a full barrier." > > > > [5] https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html > > 6.59 Built-in Functions for Memory Model Aware Atomic Operations > >