Re: [PATCH bpf] riscv, bpf: make some atomic operations fully ordered

Ilya Leoshkevich <iii@xxxxxxxxxxxxx> · Tue, 07 May 2024 00:56:25 +0200

On Mon, 2024-05-06 at 14:46 +0000, Puranjay Mohan wrote:
> Ilya Leoshkevich <iii@xxxxxxxxxxxxx> writes:
> 
> > > Puranjay Mohan <puranjay@xxxxxxxxxx> writes:
> > > 
> > > > The BPF atomic operations with the BPF_FETCH modifier along
> > > > with
> > > > BPF_XCHG and BPF_CMPXCHG are fully ordered but the RISC-V JIT
> > > > implements
> > > > all atomic operations except BPF_CMPXCHG with relaxed ordering.
> > > 
> > > I know that the BPF memory model is in the works and we currently
> > > don't
> > > have a way to make all the JITs consistent. But as far as atomic
> > > operations are concerned here are my observations:
> > 
> > [...]
> > 
> > > 4. S390
> > >    ----
> > > 
> > > Ilya, can you help with this?
> > > 
> > > I see that the kernel is emitting "bcr 14,0" after "laal|laalg"
> > > but
> > > the
> > > JIT is not.
> > 
> > Hi,
> > 
> > Here are two relevant paragraphs from the Principles of Operation:
> > 
> >   Relation between Operand Accesses
> >   =================================
> >   As observed by other CPUs and by channel pro-
> >   grams, storage-operand fetches associated with one
> >   instruction execution appear to precede all storage-
> >   operand references for conceptually subsequent
> >   instructions. A storage-operand store specified by
> >   one instruction appears to precede all storage-oper-
> >   and stores specified by conceptually subsequent
> >   instructions, but it does not necessarily precede stor-
> >   age-operand fetches specified by conceptually sub-
> >   sequent instructions. However, a storage-operand
> >   store appears to precede a conceptually subsequent
> >   storage-operand fetch from the same main-storage
> >   location.
> > 
> > In short, all memory accesses are fully ordered except for
> > stores followed by fetches from different addresses.
> 
> Thanks for sharing the details.
> 
> So, this is TSO like x86
> 
> >   LAALG R1,R3,D2(B2)
> >   ==================
> >   [...]
> >   All accesses to the second-operand location appear
> >   to be a block-concurrent interlocked-update refer-
> >   ence as observed by other CPUs and the I/O subsys-
> >   tem. A specific-operand-serialization operation is
> >   performed.
> > 
> > Specific-operand-serialization is weaker than full serialization,
> > which means that, even though s390x provides very strong ordering
> > guarantees, strictly speaking, as architected, s390x atomics are
> > not
> > fully ordered.
> > 
> > I have a hard time thinking of a situation where a store-fetch
> > reordering for different addresses could matter, but to be on the
> > safe
> > side we should probably just do what the kernel does and add a
> > "bcr 14,0". I will send a patch.
> 
> Thanks,
> 
> IMO, bcr 14,0 would be needed because, s390x has both
> 
>   int __atomic_add(int i, int *v);
> 
> and
> 
>   int __atomic_add_barrier(int i, int *v);
> 
> both of these do the fetch operation but the second one adds a
> barrier
> (bcr 14, 0)
> 
> JIT was using the first one (without barrier) to implement the
> arch_atomic_fetch_add
> 
> So, assuming that without this barrier, store->fetch reordering would
> be
> allowed as you said.
> 
> It would mean:
> This litmus test would fail for the s390 JIT:
> 
>   C SB+atomic_fetch
>   
>   (*
>    * Result: Never
>    *
>    * This litmus test demonstrates that LKMM expects total ordering
> from
>    * atomic_*() operations with fetch or return.
>    *)
>   
>   {
>           atomic_t dummy1 = ATOMIC_INIT(1);
>           atomic_t dummy2 = ATOMIC_INIT(1);
>   }
>   
>   P0(int *x, int *y, atomic_t *dummy1)
>   {
>           WRITE_ONCE(*x, 1);
>           rd = atomic_fetch_add(1, dummy1);     /* assuming this is
> implemented without barrier */ 
>           r0 = READ_ONCE(*y);
>   }
>   
>   P1(int *x, int *y, atomic_t *dummy2)
>   {
>           WRITE_ONCE(*y, 1);
>           rd = atomic_fetch_add(1, dummy2);    /* assuming this is
> implemented without barrier */
>           r1 = READ_ONCE(*x);
>   }
>   
>   exists (0:r0=0 /\ 1:r1=0)
> 
> 
> P.S. - It would be nice to have a tool that can convert litmus tests
> into BPF assembly programs and then we can test them on hardware
> after JITing.
> 
> Thanks,
> Puranjay

Thanks for providing an example! So unrelated memory accesses may rely
on atomics being barriers.

I will adjust my commit message, since now I claim that we are doing
the change "just in case", but apparently, if the hardware chooses to
exercise all the freedoms allowed by the architecture, there can occur
real problems.