> On 8/5/24 10:53 AM, Alexei Starovoitov wrote: >> On Fri, Aug 2, 2024 at 7:59 PM Yonghong Song <yonghong.song@xxxxxxxxx> wrote: >>> Peilen Ye reported an issue ([1]) where for __sync_fetch_and_add(...) without >>> return value like >>> __sync_fetch_and_add(&foo, 1); >>> llvm BPF backend generates locked insn e.g. >>> lock *(u32 *)(r1 + 0) += r2 >>> >>> If __sync_fetch_and_add(...) returns a value like >>> res = __sync_fetch_and_add(&foo, 1); >>> llvm BPF backend generates like >>> r2 = atomic_fetch_add((u32 *)(r1 + 0), r2) >>> >>> But 'lock *(u32 *)(r1 + 0) += r2' caused a problem in jit >>> since proper barrier is not inserted based on __sync_fetch_and_add() semantics. >>> >>> The above discrepancy is due to commit [2] where it tries to maintain backward >>> compatability since before commit [2], __sync_fetch_and_add(...) generates >>> lock insn in BPF backend. >>> >>> Based on discussion in [1], now it is time to fix the above discrepancy so we can >>> have proper barrier support in jit. llvm patch [3] made sure that __sync_fetch_and_add(...) >>> always generates atomic_fetch_add(...) insns. Now 'lock *(u32 *)(r1 + 0) += r2' can only >>> be generated by inline asm. The same for __sync_fetch_and_and(), __sync_fetch_and_or() >>> and __sync_fetch_and_xor(). >>> >>> But the change in [3] caused arena_atomics selftest failure. >>> >>> test_arena_atomics:PASS:arena atomics skeleton open 0 nsec >>> libbpf: prog 'and': BPF program load failed: Permission denied >>> libbpf: prog 'and': -- BEGIN PROG LOAD LOG -- >>> arg#0 reference type('UNKNOWN ') size cannot be determined: -22 >>> 0: R1=ctx() R10=fp0 >>> ; if (pid != (bpf_get_current_pid_tgid() >> 32)) @ arena_atomics.c:87 >>> 0: (18) r1 = 0xffffc90000064000 ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4) >>> 2: (61) r6 = *(u32 *)(r1 +0) ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4) R6_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) >>> 3: (85) call bpf_get_current_pid_tgid#14 ; R0_w=scalar() >>> 4: (77) r0 >>= 32 ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) >>> 5: (5d) if r0 != r6 goto pc+11 ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) R6_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0x) >>> ; __sync_fetch_and_and(&and64_value, 0x011ull << 32); @ arena_atomics.c:91 >>> 6: (18) r1 = 0x100000000060 ; R1_w=scalar() >>> 8: (bf) r1 = addr_space_cast(r1, 0, 1) ; R1_w=arena >>> 9: (18) r2 = 0x1100000000 ; R2_w=0x1100000000 >>> 11: (db) r2 = atomic64_fetch_and((u64 *)(r1 +0), r2) >>> BPF_ATOMIC stores into R1 arena is not allowed >>> processed 9 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0 >>> -- END PROG LOAD LOG -- >>> libbpf: prog 'and': failed to load: -13 >>> libbpf: failed to load object 'arena_atomics' >>> libbpf: failed to load BPF skeleton 'arena_atomics': -13 >>> test_arena_atomics:FAIL:arena atomics skeleton load unexpected error: -13 (errno 13) >>> #3 arena_atomics:FAIL >>> >>> The reason of the failure is due to [4] where atomic{64,}_fetch_{and,or,xor}() are not >>> allowed by arena addresses. Without llvm patch [3], the compiler will generate 'lock ...' >>> insn and everything will work fine. >>> >>> This patch fixed the problem by using inline asms. Instead of __sync_fetch_and_{and,or,xor}() functions, >>> the inline asm with 'lock' insn is used and it will work with or without [3]. >>> Note that three bpf programs ('and', 'or' and 'xor') are guarded with __BPF_FEATURE_ADDR_SPACE_CAST >>> as well to ensure compilation failure for llvm <= 18 version. Note that for llvm <= 18 where >>> addr_space_cast is not supported, all arena_atomics subtests are skipped with below message: >>> test_arena_atomics:SKIP:no ENABLE_ATOMICS_TESTS or no addr_space_cast support in clang >>> #3 arena_atomics:SKIP >>> >>> [1] https://lore.kernel.org/bpf/ZqqiQQWRnz7H93Hc@xxxxxxxxxx/T/#mb68d67bc8f39e35a0c3db52468b9de59b79f021f >>> [2] https://github.com/llvm/llvm-project/commit/286daafd65129228e08a1d07aa4ca74488615744 >>> [3] https://github.com/llvm/llvm-project/pull/101428 >>> [4] d503a04f8bc0 ("bpf: Add support for certain atomics in bpf_arena to x86 JIT") >>> >>> Signed-off-by: Yonghong Song <yonghong.song@xxxxxxxxx> >>> --- >>> .../selftests/bpf/progs/arena_atomics.c | 63 ++++++++++++++++--- >>> 1 file changed, 54 insertions(+), 9 deletions(-) >>> >>> Changelog: >>> v1 -> v2: >>> - Add __BPF_FEATURE_ADDR_SPACE_CAST to guard newly added asm codes for llvm >= 19 >>> >>> diff --git a/tools/testing/selftests/bpf/progs/arena_atomics.c b/tools/testing/selftests/bpf/progs/arena_atomics.c >>> index bb0acd79d28a..dea54557fc00 100644 >>> --- a/tools/testing/selftests/bpf/progs/arena_atomics.c >>> +++ b/tools/testing/selftests/bpf/progs/arena_atomics.c >>> @@ -5,6 +5,7 @@ >>> #include <bpf/bpf_tracing.h> >>> #include <stdbool.h> >>> #include "bpf_arena_common.h" >>> +#include "bpf_misc.h" >>> >>> struct { >>> __uint(type, BPF_MAP_TYPE_ARENA); >>> @@ -85,10 +86,24 @@ int and(const void *ctx) >>> { >>> if (pid != (bpf_get_current_pid_tgid() >> 32)) >>> return 0; >>> -#ifdef ENABLE_ATOMICS_TESTS >>> +#if defined(ENABLE_ATOMICS_TESTS) && defined(__BPF_FEATURE_ADDR_SPACE_CAST) >>> >>> - __sync_fetch_and_and(&and64_value, 0x011ull << 32); >>> - __sync_fetch_and_and(&and32_value, 0x011); >>> + asm volatile( >>> + "r1 = addr_space_cast(%[and64_value], 0, 1);" >>> + "lock *(u64 *)(r1 + 0) &= %[val]" >>> + : >>> + : __imm_ptr(and64_value), >>> + [val]"r"(0x011ull << 32) >>> + : "r1" >>> + ); >>> + asm volatile( >>> + "r1 = addr_space_cast(%[and32_value], 0, 1);" >>> + "lock *(u32 *)(r1 + 0) &= %[val]" >>> + : >>> + : __imm_ptr(and32_value), >>> + [val]"w"(0x011) >>> + : "r1" >>> + ); >> Instead of inline asm there is a better way to do the same in C. >> https://godbolt.org/z/71PYx1oqE >> >> void foo(int a, _Atomic int *b) >> { >> *b += a; >> } >> >> generates: >> lock *(u32 *)(r2 + 0) += r1 > > If you use latest llvm-project with > https://github.com/llvm/llvm-project/pull/101428 > included, the above code will generate like > > $ clang --target=bpf -O2 -c t.c && llvm-objdump -d t.o > t.o: file format elf64-bpf > Disassembly of section .text: > 0000000000000000 <foo>: > 0: c3 12 00 00 01 00 00 00 r1 = atomic_fetch_add((u32 *)(r2 + 0x0), r1) > 1: 95 00 00 00 00 00 00 00 exit > > > With -mcpu=v3 the same code can be generated. Same for current GCC. >> >> but >> *b &= a; >> >> crashes llvm :( with >> >> <source>:3:5: error: unsupported atomic operation, please use 64 bit version >> 3 | *b &= a; > > It failed with the following llvm error message: > t.c:1:6: error: unsupported atomic operation, please use 64 bit version > 1 | void foo(int a, _Atomic int *b) > | ^ > fatal error: error in backend: Cannot select: t8: i64,ch = AtomicLoadAnd<(load store seq_cst (s32) on %ir.b)> t0, t4, t2 > t4: i64,ch = CopyFromReg t0, Register:i64 %1 > t3: i64 = Register %1 > t2: i64,ch = CopyFromReg t0, Register:i64 %0 > t1: i64 = Register %0 > In function: foo > >> >> but works with -mcpu=v3 > > Yes. it does work with -mcpu=v3: In GCC if you specify -mcpu=v2 and use atomic built-ins you get workaround code in the form of libcalls (like calls to __atomic_fetch_and_4) which are of course of no use in BPF atm. > $ clang --target=bpf -O2 -mcpu=v3 -c t.c && llvm-objdump -d --mcpu=v3 t.o > > t.o: file format elf64-bpf > Disassembly of section .text: > 0000000000000000 <foo>: > 0: c3 12 00 00 51 00 00 00 w1 = atomic_fetch_and((u32 *)(r2 + 0x0), w1) > 1: 95 00 00 00 00 00 00 00 exit > > NOTE: I need -mcpu=v3 for llvm-objdump to print asm code 'atomic_fetch_and' properly. > Will double check this. The GNU binutils objdump will try to recognize instructions in the latest supported cpu version, unless an explicit option is passed to specify a particular ISA version: $ bpf-unknown-none-objdump -M pseudoc -d foo.o foo.o: file format elf64-bpfle Disassembly of section .text: 0000000000000000 <foo>: 0: bf 11 20 00 00 00 00 00 r1 = (s32) r1 8: c3 12 00 00 51 00 00 00 w1=atomic_fetch_and((u32*)(r2+0),w1) 10: 95 00 00 00 00 00 00 00 exit $ bpf-unknown-none-objdump -M v2,pseudoc -d foo.o foo.o: file format elf64-bpfle Disassembly of section .text: 0000000000000000 <foo>: 0: bf 11 20 00 00 00 00 00 r1=r1 8: c3 12 00 00 51 00 00 00 <unknown> 10: 95 00 00 00 00 00 00 00 exit > For code: > void foo(int a, _Atomic int *b) > { > *b &= a; > } > > The initial IR generated by clang frontend is: > > define dso_local void @foo(i32 noundef %a, ptr noundef %b) #0 { > entry: > %a.addr = alloca i32, align 4 > %b.addr = alloca ptr, align 8 > store i32 %a, ptr %a.addr, align 4, !tbaa !3 > store ptr %b, ptr %b.addr, align 8, !tbaa !7 > %0 = load i32, ptr %a.addr, align 4, !tbaa !3 > %1 = load ptr, ptr %b.addr, align 8, !tbaa !7 > %2 = atomicrmw and ptr %1, i32 %0 seq_cst, align 4 > %3 = and i32 %2, %0 > ret void > } > > Note that atomicrmw in the above. Eventually it optimized to > > define dso_local void @foo(i32 noundef %a, ptr noundef %b) #0 { > entry: > %0 = atomicrmw and ptr %b, i32 %a seq_cst, align 4 > ret void > } > > The 'atomicrmw' is the same IR as generated by > __sync_fetch_and_*() and eventually will generate atomic_fetch_*() bpf > insn. > Discussed with Andrii, and > another option is to specify relaxed consistency, so llvm > internal could translate it into locked insn. For example, > > $ cat t1.c > #include <stdatomic.h> > > void f(_Atomic int *i) { > __c11_atomic_fetch_and(i, 1, memory_order_relaxed); > } I think this makes sense. Currently we removed the GCC insns atomic_{add,or,xor,and} all together so the compiler is forced to implement them in terms of atomic_fetch_and_{add,or,xor,and} regardless of the memory ordering policy specified to the __sync_fetch_and_OP. So even if you specify relaxed memory ordered, the resulting ordering is whatever implied by the fetching operation. > # to have gnu/stubs-32.h in the current directory to make it compile > [yhs@devvm1513.prn0 ~/tmp6]$ ls gnu > stubs-32.h > [yhs@devvm1513.prn0 ~/tmp6]$ clang --target=bpf -O2 -I. -c -mcpu=v3 t1.c > > The initial IR: > define dso_local void @f(ptr noundef %i) #0 { > entry: > %i.addr = alloca ptr, align 8 > %.atomictmp = alloca i32, align 4 > %atomic-temp = alloca i32, align 4 > store ptr %i, ptr %i.addr, align 8, !tbaa !3 > %0 = load ptr, ptr %i.addr, align 8, !tbaa !3 > store i32 1, ptr %.atomictmp, align 4, !tbaa !7 > %1 = load i32, ptr %.atomictmp, align 4 > %2 = atomicrmw and ptr %0, i32 %1 monotonic, align 4 > store i32 %2, ptr %atomic-temp, align 4 > %3 = load i32, ptr %atomic-temp, align 4, !tbaa !7 > ret void > } > > The IR right before machine code generation: > > define dso_local void @f(ptr nocapture noundef %i) local_unnamed_addr #0 { > entry: > %0 = atomicrmw and ptr %i, i32 1 monotonic, align 4 > ret void > } > > Maybe we could special process the above to generate > a locked insn if > - atomicrmw operator > - monotonic (related) consistency > - return value is not used > > So this will not violate original program semantics. > Does this sound a reasonable apporach? Whether monotonic consistency is desired (ordered writes) can be probably deduced from the memory_order_* flag of the built-ins, but I don't know what atomiccrmw is... what is it in non-llvm terms? >> >> So let's make this test mcpu=v3 only and use normal C ? >> >> pw-bot: cr