On 8/1/24 7:20 AM, Jose E. Marchesi wrote:
On 7/29/24 6:28 PM, Alexei Starovoitov wrote:On Mon, Jul 29, 2024 at 11:33 AM Peilin Ye <yepeilin@xxxxxxxxxx> wrote:Hi list! As we are looking at running sched_ext-style BPF scheduling on architectures with a more relaxed memory model (i.e. ARM), we would like to: 1. have fine-grained control over memory ordering in BPF (instead of defaulting to a full barrier), for performance reasons 2. pay closer attention to if memory barriers are being used correctly in BPF To that end, our main goal here is to support more types of memory barriers in BPF. While Paul E. McKenney et al. are working on the formalized BPF memory model [1], Paul agreed that it makes sense to support some basic types first. Additionally, we noticed an issue with the __sync_*fetch*() compiler built-ins related to memory ordering, which will be described in details below. I. We need more types of BPF memory barriers -------------------------------------------- Currently, when it comes to BPF memory barriers, our choices are effectively limited to: * compiler barrier: 'asm volatile ("" ::: "memory");' * full memory barriers implied by compiler built-ins like __sync_val_compare_and_swap() We need more. During offline discussion with Paul, we agreed we can start from: * load-acquire: __atomic_load_n(... memorder=__ATOMIC_ACQUIRE); * store-release: __atomic_store_n(... memorder=__ATOMIC_RELEASE);we would need inline asm equivalent too. Similar to kernel smp_load_acquire() macro.Theoretically, the BPF JIT compiler could also reorder instructions just like Clang or GCC, though it might not currently do so. If we ever developed a more optimizing BPF JIT compiler, it would also be nice to have an optimization barrier for it. However, Alexei Starovoitov has expressed that defining a BPF instruction with 'asm volatile ("" ::: "memory");' semantics might be tricky.It can be a standalone insn that is a compiler barrier only but that feels like a waste of an instruction. So depending how we end up encoding various real barriers there may be a bit to spend in such a barrier insn that is only a compiler barrier. In this case optimizing JIT barrier.II. Implicit barriers can get confusing --------------------------------------- We noticed that, as a bit of a surprise, the __sync_*fetch*() built-ins do not always imply a full barrier for BPF on ARM. For example, when using LLVM, the frequently-used __sync_fetch_and_add() can either imply "relaxed" (no barrier), or "acquire and release" (full barrier) semantics, depending on if its return value is used: Case (a): return value is used SEC("...") int64_t foo; int64_t func(...) { return __sync_fetch_and_add(&foo, 1); } For case (a), Clang gave us: 3: db 01 00 00 01 00 00 00 r0 = atomic_fetch_add((u64 *)(r1 + 0x0), r0) opcode (0xdb): BPF_STX | BPF_ATOMIC | BPF_DW imm (0x00000001): BPF_ADD | BPF_FETCH Case (b): return value is ignored SEC("...") int64_t foo; int64_t func(...) { __sync_fetch_and_add(&foo, 1); return foo; } For case (b), Clang gave us: 3: db 12 00 00 00 00 00 00 lock *(u64 *)(r2 + 0x0) += r1 opcode (0xdb): BPF_STX | BPF_ATOMIC | BPF_DW imm (0x00000000): BPF_ADD LLVM decided to drop BPF_FETCH, since the return value of __sync_fetch_and_add() is being ignored [2]. Now, if we take a look at emit_lse_atomic() in the BPF JIT compiler code for ARM64 (suppose that LSE atomic instructions are being used): case BPF_ADD: emit(A64_STADD(isdw, reg, src), ctx); break; <...> case BPF_ADD | BPF_FETCH: emit(A64_LDADDAL(isdw, src, reg, src), ctx); break; STADD is an alias for LDADD. According to [3]: * LDADDAL for case (a) has "acquire" plus "release" semantics * LDADD for case (b) "has neither acquire nor release semantics" This is pretty non-intuitive; a compiler built-in should not have inconsistent implications on memory ordering, and it is better not to require all BPF programmers to memorize this. GCC seems a bit ambiguous [4] on whether __sync_*fetch*() built-ins should always imply a full barrier. GCC considers these __sync_*() built-ins as "legacy", and introduced a new set of __atomic_*() built-ins ("Memory Model Aware Atomic Operations") [5] to replace them. These __atomic_*() built-ins are designed to be a lot more explicit on memory ordering, for example: type __atomic_fetch_add (type *ptr, type val, int memorder) This requires the programmer to specify a memory order type (relaxed, acquire, release...) via the "memorder" parameter. Currently in LLVM, for BPF, those __atomic_*fetch*() built-ins seem to be aliases to their __sync_*fetch*() counterparts (the "memorder" parameter seems effectively ignored), and are not fully supported.This sounds like a compiler bug. Yonghong, Jose, do you know what compilers do for other backends? Is it allowed to convert sycn_fetch_add into sync_add when fetch part is unused?This behavior is introduced by the following llvm commit: https://github.com/llvm/llvm-project/commit/286daafd65129228e08a1d07aa4ca74488615744 Specifically the following commit message: ======= Similar to xadd, atomic xadd, xor and xxor (atomic_<op>) instructions are added for atomic operations which do not have return values. LLVM will check the return value for __sync_fetch_and_{add,and,or,xor}. If the return value is used, instructions atomic_fetch_<op> will be used. Otherwise, atomic_<op> instructions will be used. ====== Basically, if no return value, __sync_fetch_and_add() will use xadd insn. The decision is made at that time to maintain backward compatibility. For one example, in bcc https://github.com/iovisor/bcc/blob/master/src/cc/export/helpers.h#L1444 we have #define lock_xadd(ptr, val) ((void)__sync_fetch_and_add(ptr, val)) Should we use atomic_fetch_*() always regardless of whether the return val is used or not? Probably, it should still work. Not sure what gcc does for this case.GCC behaves similarly. For program A: long foo;long func () {return __sync_fetch_and_add(&foo, 1); } bpf-unknown-none-gcc -O2 compiles to: 0000000000000000 <func>: 0: 18 00 00 00 00 00 00 00 r0=0 ll 8: 00 00 00 00 00 00 00 00 10: b7 01 00 00 01 00 00 00 r1=1 18: db 10 00 00 01 00 00 00 r1=atomic_fetch_add((u64*)(r0+0),r1) 20: bf 10 00 00 00 00 00 00 r0=r1 28: 95 00 00 00 00 00 00 00 exit And for program B: long foo;long func () {__sync_fetch_and_add(&foo, 1); return foo; } bpf-unknown-none-gcc -O2 compiles to: 0000000000000000 <func>: 0: 18 00 00 00 00 00 00 00 r0=0 ll 8: 00 00 00 00 00 00 00 00 10: b7 01 00 00 01 00 00 00 r1=1 18: db 10 00 00 00 00 00 00 lock *(u64*)(r0+0)+=r1 20: 79 00 00 00 00 00 00 00 r0=*(u64*)(r0+0) 28: 95 00 00 00 00 00 00 00 exit Internally: - When compiling the program A GCC decides to emit an `atomic_fetch_addDI' insn, documented as: 'atomic_fetch_addMODE', 'atomic_fetch_subMODE' 'atomic_fetch_orMODE', 'atomic_fetch_andMODE' 'atomic_fetch_xorMODE', 'atomic_fetch_nandMODE' These patterns emit code for an atomic operation on memory with memory model semantics, and return the original value. Operand 0 is an output operand which contains the value of the memory location before the operation was performed. Operand 1 is the memory on which the atomic operation is performed. Operand 2 is the second operand to the binary operator. Operand 3 is the memory model to be used by the operation. The BPF backend defines atomic_fetch_add for DI modes (long) to expand to this BPF instruction: %w0 = atomic_fetch_add((<smop> *)%1, %w0) - When compiling the program B GCC decides to emit an `atomic_addDI' insn, documented as: 'atomic_addMODE', 'atomic_subMODE' 'atomic_orMODE', 'atomic_andMODE' 'atomic_xorMODE', 'atomic_nandMODE' These patterns emit code for an atomic operation on memory with memory model semantics. Operand 0 is the memory on which the atomic operation is performed. Operand 1 is the second operand to the binary operator. Operand 2 is the memory model to be used by the operation. The BPF backend defines atomic_fetch_add for DI modes (long) to expand to this BPF instruction: lock *(<smop> *)%w0 += %w1 This is done for all targets. In x86-64, for example, case A compiles to: 0000000000000000 <func>: 0: b8 01 00 00 00 mov $0x1,%eax 5: f0 48 0f c1 05 00 00 lock xadd %rax,0x0(%rip) # e <func+0xe> c: 00 00 e: c3 retq And case B compiles to: 0000000000000000 <func>: 0: f0 48 83 05 00 00 00 lock addq $0x1,0x0(%rip) # 9 <func+0x9> 7: 00 01 9: 48 8b 05 00 00 00 00 mov 0x0(%rip),%rax # 10 <func+0x10> 10: c3 retq Why wouldn't the compiler be allowed to optimize from atomic_fetch_add to atomic_add in this case?Ok I see. The generic compiler optimization is ok. It is the backend that is buggy because it emits BPF instruction sequences with different memory ordering semantics for atomic_OP and atomic_fetch_OP. The only difference between fetching and non-fetching builtins is that in one case the original value is returned, in the other the new value. Other than that they should be equivalent. For ARM64, GCC generates for case A: 0000000000000000 <func>: 0: 90000001 adrp x1, 0 <func> 4: d2800020 mov x0, #0x1 // #1 8: 91000021 add x1, x1, #0x0 c: f8e00020 ldaddal x0, x0, [x1] 10: d65f03c0 ret And this for case B: 0000000000000000 <func>: 0: 90000000 adrp x0, 0 <func> 4: d2800022 mov x2, #0x1 // #1 8: 91000001 add x1, x0, #0x0 c: f8e20021 ldaddal x2, x1, [x1] 10: f9400000 ldr x0, [x0] 14: d65f03c0 ret i.e. GCC emits LDADDAL for both atomic_add and atomic_fetch_add internal insns. Like in x86-64, both sequences have same memory ordering semantics. Allright we are changing GCC to always emit fetch versions of sequences for all the supported atomic operations: add, and, or, xor. After the change the `lock' versions of the instructions will not be generated by the compiler at all out of inline asm. Will send a headsup when done.
Thanks! https://github.com/llvm/llvm-project/pull/101428 is the change in llvm side. [...]