Re: Supporting New Memory Barrier Types in BPF

Peilin Ye <yepeilin@xxxxxxxxxx> · Thu, 1 Aug 2024 22:00:04 +0000

Hi Jose,

On Thu, Aug 01, 2024 at 04:20:30PM +0200, Jose E. Marchesi wrote:
> > GCC behaves similarly.
> >
> > For program A:
> >
> >   long foo;
> >   
> >   long func () {
> >         return __sync_fetch_and_add(&foo, 1);
> >   }
> >
> > bpf-unknown-none-gcc -O2 compiles to:
> >
> >   0000000000000000 <func>:
> >      0:	18 00 00 00 00 00 00 00 	r0=0 ll
> >      8:	00 00 00 00 00 00 00 00 
> >     10:	b7 01 00 00 01 00 00 00 	r1=1
> >     18:	db 10 00 00 01 00 00 00 	r1=atomic_fetch_add((u64*)(r0+0),r1)
> >     20:	bf 10 00 00 00 00 00 00 	r0=r1
> >     28:	95 00 00 00 00 00 00 00 	exit
> >
> > And for program B:
> >
> >   long foo;
> >   
> >   long func () {
> >        __sync_fetch_and_add(&foo, 1);
> >         return foo;
> >   }
> >
> > bpf-unknown-none-gcc -O2 compiles to:
> >
> >   0000000000000000 <func>:
> >      0:	18 00 00 00 00 00 00 00 	r0=0 ll
> >      8:	00 00 00 00 00 00 00 00 
> >     10:	b7 01 00 00 01 00 00 00 	r1=1
> >     18:	db 10 00 00 00 00 00 00 	lock *(u64*)(r0+0)+=r1
> >     20:	79 00 00 00 00 00 00 00 	r0=*(u64*)(r0+0)
> >     28:	95 00 00 00 00 00 00 00 	exit
> >
> > Internally:
> >
> > - When compiling the program A GCC decides to emit an
> >   `atomic_fetch_addDI' insn, documented as:
> >
> >   'atomic_fetch_addMODE', 'atomic_fetch_subMODE'
> >   'atomic_fetch_orMODE', 'atomic_fetch_andMODE'
> >   'atomic_fetch_xorMODE', 'atomic_fetch_nandMODE'
> >
> >      These patterns emit code for an atomic operation on memory with
> >      memory model semantics, and return the original value.  Operand 0
> >      is an output operand which contains the value of the memory
> >      location before the operation was performed.  Operand 1 is the
> >      memory on which the atomic operation is performed.  Operand 2 is
> >      the second operand to the binary operator.  Operand 3 is the memory
> >      model to be used by the operation.
> >
> >   The BPF backend defines atomic_fetch_add for DI modes (long) to expand
> >   to this BPF instruction:
> >
> >       %w0 = atomic_fetch_add((<smop> *)%1, %w0)
> >
> > - When compiling the program B GCC decides to emit an `atomic_addDI'
> >   insn, documented as:
> >
> >   'atomic_addMODE', 'atomic_subMODE'
> >   'atomic_orMODE', 'atomic_andMODE'
> >   'atomic_xorMODE', 'atomic_nandMODE'
> >
> >      These patterns emit code for an atomic operation on memory with
> >      memory model semantics.  Operand 0 is the memory on which the
> >      atomic operation is performed.  Operand 1 is the second operand to
> >      the binary operator.  Operand 2 is the memory model to be used by
> >      the operation.
> >
> >   The BPF backend defines atomic_fetch_add for DI modes (long) to expand
> >   to this BPF instruction:
> >
> >       lock *(<smop> *)%w0 += %w1
> >
> > This is done for all targets. In x86-64, for example, case A compiles
> > to:
> >
> >   0000000000000000 <func>:
> >      0:	b8 01 00 00 00       	mov    $0x1,%eax
> >      5:	f0 48 0f c1 05 00 00 	lock xadd %rax,0x0(%rip)        # e <func+0xe>
> >      c:	00 00 
> >      e:	c3                   	retq   
> >
> > And case B compiles to:
> >
> >   0000000000000000 <func>:
> >      0:	f0 48 83 05 00 00 00 	lock addq $0x1,0x0(%rip)        # 9 <func+0x9>
> >      7:	00 01 
> >      9:	48 8b 05 00 00 00 00 	mov    0x0(%rip),%rax        # 10 <func+0x10>
> >     10:	c3                   	retq   
> >
> > Why wouldn't the compiler be allowed to optimize from atomic_fetch_add
> > to atomic_add in this case?
> 
> Ok I see.  The generic compiler optimization is ok.  It is the backend
> that is buggy because it emits BPF instruction sequences with different
> memory ordering semantics for atomic_OP and atomic_fetch_OP.
> 
> The only difference between fetching and non-fetching builtins is that
> in one case the original value is returned, in the other the new value.
> Other than that they should be equivalent.
> 
> For ARM64, GCC generates for case A:
> 
>   0000000000000000 <func>:
>      0:	90000001 	adrp	x1, 0 <func>
>      4:	d2800020 	mov	x0, #0x1                   	// #1
>      8:	91000021 	add	x1, x1, #0x0
>      c:	f8e00020 	ldaddal	x0, x0, [x1]
>     10:	d65f03c0 	ret
> 
> And this for case B:
> 
>   0000000000000000 <func>:
>      0:	90000000 	adrp	x0, 0 <func>
>      4:	d2800022 	mov	x2, #0x1                   	// #1
>      8:	91000001 	add	x1, x0, #0x0
>      c:	f8e20021 	ldaddal	x2, x1, [x1]
>     10:	f9400000 	ldr	x0, [x0]
>     14:	d65f03c0 	ret
> 
> i.e. GCC emits LDADDAL for both atomic_add and atomic_fetch_add internal
> insns.  Like in x86-64, both sequences have same memory ordering
> semantics.
> 
> Allright we are changing GCC to always emit fetch versions of sequences
> for all the supported atomic operations: add, and, or, xor.  After the
> change the `lock' versions of the instructions will not be generated by
> the compiler at all out of inline asm.
> 
> Will send a headsup when done.

Thanks for taking care of this!

Peilin Ye