Hi Jose, On Thu, Aug 01, 2024 at 04:20:30PM +0200, Jose E. Marchesi wrote: > > GCC behaves similarly. > > > > For program A: > > > > long foo; > > > > long func () { > > return __sync_fetch_and_add(&foo, 1); > > } > > > > bpf-unknown-none-gcc -O2 compiles to: > > > > 0000000000000000 <func>: > > 0: 18 00 00 00 00 00 00 00 r0=0 ll > > 8: 00 00 00 00 00 00 00 00 > > 10: b7 01 00 00 01 00 00 00 r1=1 > > 18: db 10 00 00 01 00 00 00 r1=atomic_fetch_add((u64*)(r0+0),r1) > > 20: bf 10 00 00 00 00 00 00 r0=r1 > > 28: 95 00 00 00 00 00 00 00 exit > > > > And for program B: > > > > long foo; > > > > long func () { > > __sync_fetch_and_add(&foo, 1); > > return foo; > > } > > > > bpf-unknown-none-gcc -O2 compiles to: > > > > 0000000000000000 <func>: > > 0: 18 00 00 00 00 00 00 00 r0=0 ll > > 8: 00 00 00 00 00 00 00 00 > > 10: b7 01 00 00 01 00 00 00 r1=1 > > 18: db 10 00 00 00 00 00 00 lock *(u64*)(r0+0)+=r1 > > 20: 79 00 00 00 00 00 00 00 r0=*(u64*)(r0+0) > > 28: 95 00 00 00 00 00 00 00 exit > > > > Internally: > > > > - When compiling the program A GCC decides to emit an > > `atomic_fetch_addDI' insn, documented as: > > > > 'atomic_fetch_addMODE', 'atomic_fetch_subMODE' > > 'atomic_fetch_orMODE', 'atomic_fetch_andMODE' > > 'atomic_fetch_xorMODE', 'atomic_fetch_nandMODE' > > > > These patterns emit code for an atomic operation on memory with > > memory model semantics, and return the original value. Operand 0 > > is an output operand which contains the value of the memory > > location before the operation was performed. Operand 1 is the > > memory on which the atomic operation is performed. Operand 2 is > > the second operand to the binary operator. Operand 3 is the memory > > model to be used by the operation. > > > > The BPF backend defines atomic_fetch_add for DI modes (long) to expand > > to this BPF instruction: > > > > %w0 = atomic_fetch_add((<smop> *)%1, %w0) > > > > - When compiling the program B GCC decides to emit an `atomic_addDI' > > insn, documented as: > > > > 'atomic_addMODE', 'atomic_subMODE' > > 'atomic_orMODE', 'atomic_andMODE' > > 'atomic_xorMODE', 'atomic_nandMODE' > > > > These patterns emit code for an atomic operation on memory with > > memory model semantics. Operand 0 is the memory on which the > > atomic operation is performed. Operand 1 is the second operand to > > the binary operator. Operand 2 is the memory model to be used by > > the operation. > > > > The BPF backend defines atomic_fetch_add for DI modes (long) to expand > > to this BPF instruction: > > > > lock *(<smop> *)%w0 += %w1 > > > > This is done for all targets. In x86-64, for example, case A compiles > > to: > > > > 0000000000000000 <func>: > > 0: b8 01 00 00 00 mov $0x1,%eax > > 5: f0 48 0f c1 05 00 00 lock xadd %rax,0x0(%rip) # e <func+0xe> > > c: 00 00 > > e: c3 retq > > > > And case B compiles to: > > > > 0000000000000000 <func>: > > 0: f0 48 83 05 00 00 00 lock addq $0x1,0x0(%rip) # 9 <func+0x9> > > 7: 00 01 > > 9: 48 8b 05 00 00 00 00 mov 0x0(%rip),%rax # 10 <func+0x10> > > 10: c3 retq > > > > Why wouldn't the compiler be allowed to optimize from atomic_fetch_add > > to atomic_add in this case? > > Ok I see. The generic compiler optimization is ok. It is the backend > that is buggy because it emits BPF instruction sequences with different > memory ordering semantics for atomic_OP and atomic_fetch_OP. > > The only difference between fetching and non-fetching builtins is that > in one case the original value is returned, in the other the new value. > Other than that they should be equivalent. > > For ARM64, GCC generates for case A: > > 0000000000000000 <func>: > 0: 90000001 adrp x1, 0 <func> > 4: d2800020 mov x0, #0x1 // #1 > 8: 91000021 add x1, x1, #0x0 > c: f8e00020 ldaddal x0, x0, [x1] > 10: d65f03c0 ret > > And this for case B: > > 0000000000000000 <func>: > 0: 90000000 adrp x0, 0 <func> > 4: d2800022 mov x2, #0x1 // #1 > 8: 91000001 add x1, x0, #0x0 > c: f8e20021 ldaddal x2, x1, [x1] > 10: f9400000 ldr x0, [x0] > 14: d65f03c0 ret > > i.e. GCC emits LDADDAL for both atomic_add and atomic_fetch_add internal > insns. Like in x86-64, both sequences have same memory ordering > semantics. > > Allright we are changing GCC to always emit fetch versions of sequences > for all the supported atomic operations: add, and, or, xor. After the > change the `lock' versions of the instructions will not be generated by > the compiler at all out of inline asm. > > Will send a headsup when done. Thanks for taking care of this! Peilin Ye