On Mon, 2024-05-06 at 14:46 +0000, Puranjay Mohan wrote: > Ilya Leoshkevich <iii@xxxxxxxxxxxxx> writes: > > > > Puranjay Mohan <puranjay@xxxxxxxxxx> writes: > > > > > > > The BPF atomic operations with the BPF_FETCH modifier along > > > > with > > > > BPF_XCHG and BPF_CMPXCHG are fully ordered but the RISC-V JIT > > > > implements > > > > all atomic operations except BPF_CMPXCHG with relaxed ordering. > > > > > > I know that the BPF memory model is in the works and we currently > > > don't > > > have a way to make all the JITs consistent. But as far as atomic > > > operations are concerned here are my observations: > > > > [...] > > > > > 4. S390 > > > ---- > > > > > > Ilya, can you help with this? > > > > > > I see that the kernel is emitting "bcr 14,0" after "laal|laalg" > > > but > > > the > > > JIT is not. > > > > Hi, > > > > Here are two relevant paragraphs from the Principles of Operation: > > > > Relation between Operand Accesses > > ================================= > > As observed by other CPUs and by channel pro- > > grams, storage-operand fetches associated with one > > instruction execution appear to precede all storage- > > operand references for conceptually subsequent > > instructions. A storage-operand store specified by > > one instruction appears to precede all storage-oper- > > and stores specified by conceptually subsequent > > instructions, but it does not necessarily precede stor- > > age-operand fetches specified by conceptually sub- > > sequent instructions. However, a storage-operand > > store appears to precede a conceptually subsequent > > storage-operand fetch from the same main-storage > > location. > > > > In short, all memory accesses are fully ordered except for > > stores followed by fetches from different addresses. > > Thanks for sharing the details. > > So, this is TSO like x86 > > > LAALG R1,R3,D2(B2) > > ================== > > [...] > > All accesses to the second-operand location appear > > to be a block-concurrent interlocked-update refer- > > ence as observed by other CPUs and the I/O subsys- > > tem. A specific-operand-serialization operation is > > performed. > > > > Specific-operand-serialization is weaker than full serialization, > > which means that, even though s390x provides very strong ordering > > guarantees, strictly speaking, as architected, s390x atomics are > > not > > fully ordered. > > > > I have a hard time thinking of a situation where a store-fetch > > reordering for different addresses could matter, but to be on the > > safe > > side we should probably just do what the kernel does and add a > > "bcr 14,0". I will send a patch. > > Thanks, > > IMO, bcr 14,0 would be needed because, s390x has both > > int __atomic_add(int i, int *v); > > and > > int __atomic_add_barrier(int i, int *v); > > both of these do the fetch operation but the second one adds a > barrier > (bcr 14, 0) > > JIT was using the first one (without barrier) to implement the > arch_atomic_fetch_add > > So, assuming that without this barrier, store->fetch reordering would > be > allowed as you said. > > It would mean: > This litmus test would fail for the s390 JIT: > > C SB+atomic_fetch > > (* > * Result: Never > * > * This litmus test demonstrates that LKMM expects total ordering > from > * atomic_*() operations with fetch or return. > *) > > { > atomic_t dummy1 = ATOMIC_INIT(1); > atomic_t dummy2 = ATOMIC_INIT(1); > } > > P0(int *x, int *y, atomic_t *dummy1) > { > WRITE_ONCE(*x, 1); > rd = atomic_fetch_add(1, dummy1); /* assuming this is > implemented without barrier */ > r0 = READ_ONCE(*y); > } > > P1(int *x, int *y, atomic_t *dummy2) > { > WRITE_ONCE(*y, 1); > rd = atomic_fetch_add(1, dummy2); /* assuming this is > implemented without barrier */ > r1 = READ_ONCE(*x); > } > > exists (0:r0=0 /\ 1:r1=0) > > > P.S. - It would be nice to have a tool that can convert litmus tests > into BPF assembly programs and then we can test them on hardware > after JITing. > > Thanks, > Puranjay Thanks for providing an example! So unrelated memory accesses may rely on atomics being barriers. I will adjust my commit message, since now I claim that we are doing the change "just in case", but apparently, if the hardware chooses to exercise all the freedoms allowed by the architecture, there can occur real problems.