> On 8/1/24 7:20 AM, Jose E. Marchesi wrote: >>>> On 7/29/24 6:28 PM, Alexei Starovoitov wrote: >>>>> On Mon, Jul 29, 2024 at 11:33 AM Peilin Ye <yepeilin@xxxxxxxxxx> wrote: >>>>>> Hi list! >>>>>> >>>>>> As we are looking at running sched_ext-style BPF scheduling on architectures >>>>>> with a more relaxed memory model (i.e. ARM), we would like to: >>>>>> >>>>>> 1. have fine-grained control over memory ordering in BPF (instead of >>>>>> defaulting to a full barrier), for performance reasons >>>>>> 2. pay closer attention to if memory barriers are being used correctly in >>>>>> BPF >>>>>> >>>>>> To that end, our main goal here is to support more types of memory barriers in >>>>>> BPF. While Paul E. McKenney et al. are working on the formalized BPF memory >>>>>> model [1], Paul agreed that it makes sense to support some basic types first. >>>>>> Additionally, we noticed an issue with the __sync_*fetch*() compiler built-ins >>>>>> related to memory ordering, which will be described in details below. >>>>>> >>>>>> I. We need more types of BPF memory barriers >>>>>> -------------------------------------------- >>>>>> >>>>>> Currently, when it comes to BPF memory barriers, our choices are effectively >>>>>> limited to: >>>>>> >>>>>> * compiler barrier: 'asm volatile ("" ::: "memory");' >>>>>> * full memory barriers implied by compiler built-ins like >>>>>> __sync_val_compare_and_swap() >>>>>> >>>>>> We need more. During offline discussion with Paul, we agreed we can start >>>>>> from: >>>>>> >>>>>> * load-acquire: __atomic_load_n(... memorder=__ATOMIC_ACQUIRE); >>>>>> * store-release: __atomic_store_n(... memorder=__ATOMIC_RELEASE); >>>>> we would need inline asm equivalent too. Similar to kernel >>>>> smp_load_acquire() macro. >>>>> >>>>>> Theoretically, the BPF JIT compiler could also reorder instructions just like >>>>>> Clang or GCC, though it might not currently do so. If we ever developed a more >>>>>> optimizing BPF JIT compiler, it would also be nice to have an optimization >>>>>> barrier for it. However, Alexei Starovoitov has expressed that defining a BPF >>>>>> instruction with 'asm volatile ("" ::: "memory");' semantics might be tricky. >>>>> It can be a standalone insn that is a compiler barrier only but that feels like >>>>> a waste of an instruction. So depending how we end up encoding various >>>>> real barriers >>>>> there may be a bit to spend in such a barrier insn that is only a >>>>> compiler barrier. >>>>> In this case optimizing JIT barrier. >>>>> >>>>>> II. Implicit barriers can get confusing >>>>>> --------------------------------------- >>>>>> >>>>>> We noticed that, as a bit of a surprise, the __sync_*fetch*() built-ins do not >>>>>> always imply a full barrier for BPF on ARM. For example, when using LLVM, the >>>>>> frequently-used __sync_fetch_and_add() can either imply "relaxed" (no barrier), >>>>>> or "acquire and release" (full barrier) semantics, depending on if its return >>>>>> value is used: >>>>>> >>>>>> Case (a): return value is used >>>>>> >>>>>> SEC("...") >>>>>> int64_t foo; >>>>>> >>>>>> int64_t func(...) { >>>>>> return __sync_fetch_and_add(&foo, 1); >>>>>> } >>>>>> >>>>>> For case (a), Clang gave us: >>>>>> >>>>>> 3: db 01 00 00 01 00 00 00 r0 = atomic_fetch_add((u64 *)(r1 + 0x0), r0) >>>>>> >>>>>> opcode (0xdb): BPF_STX | BPF_ATOMIC | BPF_DW >>>>>> imm (0x00000001): BPF_ADD | BPF_FETCH >>>>>> >>>>>> Case (b): return value is ignored >>>>>> >>>>>> SEC("...") >>>>>> int64_t foo; >>>>>> >>>>>> int64_t func(...) { >>>>>> __sync_fetch_and_add(&foo, 1); >>>>>> >>>>>> return foo; >>>>>> } >>>>>> >>>>>> For case (b), Clang gave us: >>>>>> >>>>>> 3: db 12 00 00 00 00 00 00 lock *(u64 *)(r2 + 0x0) += r1 >>>>>> >>>>>> opcode (0xdb): BPF_STX | BPF_ATOMIC | BPF_DW >>>>>> imm (0x00000000): BPF_ADD >>>>>> >>>>>> LLVM decided to drop BPF_FETCH, since the return value of >>>>>> __sync_fetch_and_add() is being ignored [2]. Now, if we take a look at >>>>>> emit_lse_atomic() in the BPF JIT compiler code for ARM64 (suppose that LSE >>>>>> atomic instructions are being used): >>>>>> >>>>>> case BPF_ADD: >>>>>> emit(A64_STADD(isdw, reg, src), ctx); >>>>>> break; >>>>>> <...> >>>>>> case BPF_ADD | BPF_FETCH: >>>>>> emit(A64_LDADDAL(isdw, src, reg, src), ctx); >>>>>> break; >>>>>> >>>>>> STADD is an alias for LDADD. According to [3]: >>>>>> >>>>>> * LDADDAL for case (a) has "acquire" plus "release" semantics >>>>>> * LDADD for case (b) "has neither acquire nor release semantics" >>>>>> >>>>>> This is pretty non-intuitive; a compiler built-in should not have inconsistent >>>>>> implications on memory ordering, and it is better not to require all BPF >>>>>> programmers to memorize this. >>>>>> >>>>>> GCC seems a bit ambiguous [4] on whether __sync_*fetch*() built-ins should >>>>>> always imply a full barrier. GCC considers these __sync_*() built-ins as >>>>>> "legacy", and introduced a new set of __atomic_*() built-ins ("Memory Model >>>>>> Aware Atomic Operations") [5] to replace them. These __atomic_*() built-ins >>>>>> are designed to be a lot more explicit on memory ordering, for example: >>>>>> >>>>>> type __atomic_fetch_add (type *ptr, type val, int memorder) >>>>>> >>>>>> This requires the programmer to specify a memory order type (relaxed, acquire, >>>>>> release...) via the "memorder" parameter. Currently in LLVM, for BPF, those >>>>>> __atomic_*fetch*() built-ins seem to be aliases to their __sync_*fetch*() >>>>>> counterparts (the "memorder" parameter seems effectively ignored), and are not >>>>>> fully supported. >>>>> This sounds like a compiler bug. >>>>> >>>>> Yonghong, Jose, >>>>> do you know what compilers do for other backends? >>>>> Is it allowed to convert sycn_fetch_add into sync_add when fetch part is unused? >>>> This behavior is introduced by the following llvm commit: >>>> https://github.com/llvm/llvm-project/commit/286daafd65129228e08a1d07aa4ca74488615744 >>>> >>>> Specifically the following commit message: >>>> >>>> ======= >>>> Similar to xadd, atomic xadd, xor and xxor (atomic_<op>) >>>> instructions are added for atomic operations which do not >>>> have return values. LLVM will check the return value for >>>> __sync_fetch_and_{add,and,or,xor}. >>>> If the return value is used, instructions atomic_fetch_<op> >>>> will be used. Otherwise, atomic_<op> instructions will be used. >>>> ====== >>>> >>>> Basically, if no return value, __sync_fetch_and_add() will use >>>> xadd insn. The decision is made at that time to maintain backward compatibility. >>>> For one example, in bcc >>>> https://github.com/iovisor/bcc/blob/master/src/cc/export/helpers.h#L1444 >>>> we have >>>> #define lock_xadd(ptr, val) ((void)__sync_fetch_and_add(ptr, val)) >>>> >>>> Should we use atomic_fetch_*() always regardless of whether the return >>>> val is used or not? Probably, it should still work. Not sure what gcc >>>> does for this case. >>> GCC behaves similarly. >>> >>> For program A: >>> >>> long foo; >>> long func () { >>> return __sync_fetch_and_add(&foo, 1); >>> } >>> >>> bpf-unknown-none-gcc -O2 compiles to: >>> >>> 0000000000000000 <func>: >>> 0: 18 00 00 00 00 00 00 00 r0=0 ll >>> 8: 00 00 00 00 00 00 00 00 >>> 10: b7 01 00 00 01 00 00 00 r1=1 >>> 18: db 10 00 00 01 00 00 00 r1=atomic_fetch_add((u64*)(r0+0),r1) >>> 20: bf 10 00 00 00 00 00 00 r0=r1 >>> 28: 95 00 00 00 00 00 00 00 exit >>> >>> And for program B: >>> >>> long foo; >>> long func () { >>> __sync_fetch_and_add(&foo, 1); >>> return foo; >>> } >>> >>> bpf-unknown-none-gcc -O2 compiles to: >>> >>> 0000000000000000 <func>: >>> 0: 18 00 00 00 00 00 00 00 r0=0 ll >>> 8: 00 00 00 00 00 00 00 00 >>> 10: b7 01 00 00 01 00 00 00 r1=1 >>> 18: db 10 00 00 00 00 00 00 lock *(u64*)(r0+0)+=r1 >>> 20: 79 00 00 00 00 00 00 00 r0=*(u64*)(r0+0) >>> 28: 95 00 00 00 00 00 00 00 exit >>> >>> Internally: >>> >>> - When compiling the program A GCC decides to emit an >>> `atomic_fetch_addDI' insn, documented as: >>> >>> 'atomic_fetch_addMODE', 'atomic_fetch_subMODE' >>> 'atomic_fetch_orMODE', 'atomic_fetch_andMODE' >>> 'atomic_fetch_xorMODE', 'atomic_fetch_nandMODE' >>> >>> These patterns emit code for an atomic operation on memory with >>> memory model semantics, and return the original value. Operand 0 >>> is an output operand which contains the value of the memory >>> location before the operation was performed. Operand 1 is the >>> memory on which the atomic operation is performed. Operand 2 is >>> the second operand to the binary operator. Operand 3 is the memory >>> model to be used by the operation. >>> >>> The BPF backend defines atomic_fetch_add for DI modes (long) to expand >>> to this BPF instruction: >>> >>> %w0 = atomic_fetch_add((<smop> *)%1, %w0) >>> >>> - When compiling the program B GCC decides to emit an `atomic_addDI' >>> insn, documented as: >>> >>> 'atomic_addMODE', 'atomic_subMODE' >>> 'atomic_orMODE', 'atomic_andMODE' >>> 'atomic_xorMODE', 'atomic_nandMODE' >>> >>> These patterns emit code for an atomic operation on memory with >>> memory model semantics. Operand 0 is the memory on which the >>> atomic operation is performed. Operand 1 is the second operand to >>> the binary operator. Operand 2 is the memory model to be used by >>> the operation. >>> >>> The BPF backend defines atomic_fetch_add for DI modes (long) to expand >>> to this BPF instruction: >>> >>> lock *(<smop> *)%w0 += %w1 >>> >>> This is done for all targets. In x86-64, for example, case A compiles >>> to: >>> >>> 0000000000000000 <func>: >>> 0: b8 01 00 00 00 mov $0x1,%eax >>> 5: f0 48 0f c1 05 00 00 lock xadd %rax,0x0(%rip) # e <func+0xe> >>> c: 00 00 >>> e: c3 retq >>> >>> And case B compiles to: >>> >>> 0000000000000000 <func>: >>> 0: f0 48 83 05 00 00 00 lock addq $0x1,0x0(%rip) # 9 <func+0x9> >>> 7: 00 01 >>> 9: 48 8b 05 00 00 00 00 mov 0x0(%rip),%rax # 10 <func+0x10> >>> 10: c3 retq >>> >>> Why wouldn't the compiler be allowed to optimize from atomic_fetch_add >>> to atomic_add in this case? >> Ok I see. The generic compiler optimization is ok. It is the backend >> that is buggy because it emits BPF instruction sequences with different >> memory ordering semantics for atomic_OP and atomic_fetch_OP. >> >> The only difference between fetching and non-fetching builtins is that >> in one case the original value is returned, in the other the new value. >> Other than that they should be equivalent. >> >> For ARM64, GCC generates for case A: >> >> 0000000000000000 <func>: >> 0: 90000001 adrp x1, 0 <func> >> 4: d2800020 mov x0, #0x1 // #1 >> 8: 91000021 add x1, x1, #0x0 >> c: f8e00020 ldaddal x0, x0, [x1] >> 10: d65f03c0 ret >> >> And this for case B: >> >> 0000000000000000 <func>: >> 0: 90000000 adrp x0, 0 <func> >> 4: d2800022 mov x2, #0x1 // #1 >> 8: 91000001 add x1, x0, #0x0 >> c: f8e20021 ldaddal x2, x1, [x1] >> 10: f9400000 ldr x0, [x0] >> 14: d65f03c0 ret >> >> i.e. GCC emits LDADDAL for both atomic_add and atomic_fetch_add internal >> insns. Like in x86-64, both sequences have same memory ordering >> semantics. >> >> Allright we are changing GCC to always emit fetch versions of sequences >> for all the supported atomic operations: add, and, or, xor. After the >> change the `lock' versions of the instructions will not be generated by >> the compiler at all out of inline asm. >> >> Will send a headsup when done. > > Thanks! https://github.com/llvm/llvm-project/pull/101428 > is the change in llvm side. This is now fixed in GCC upstream as well. https://gcc.gnu.org/pipermail/gcc-patches/2024-August/659454.html Salud!