Hi Linus, Paul, Apologies for the slow response here; believe it or not, I was attending a workshop about memory ordering. On Sun, Jul 31, 2022 at 10:30:11AM -0700, Paul E. McKenney wrote: > On Sun, Jul 31, 2022 at 09:51:47AM -0700, Linus Torvalds wrote: > > Even alpha is specified to be locally ordered wrt *one* memory > > location, including for reads (See table 5-1: "Processor issue order", > > and also 5.6.2.2: "Litmus test 2"). So if a previous read has seen a > > new value, a subsequent read is not allowed to see an older one - even > > without a memory barrier. > > > > Will, Paul? Maybe that's only for overlapping loads/stores, not for > > loads/loads. Because maybe alpha for once isn't the weakest possible > > ordering. > > The "bad boy" in this case is Itanium, which can do some VLIW reordering > of accesses. Or could, I am not sure that newer Itanium hardware > does this. But this is why Itanium compilers made volatile loads use > the ld,acq instruction. > > Which means that aligned same-sized marked accesses to a single location > really do execute consistently with some global ordering, even on Itanium. Although this is true, there's a really subtle issue which crops up if you try to compose this read-after-read ordering with dependencies in the case where the two reads read the same value (which is encapsulated by the unusual RSW litmus test that I've tried to convert to C below): /* Global definitions; assume everything zero-initialised */ struct foo { int *x; }; int x; struct foo foo; struct foo *ptr; /* CPU 0 */ WRITE_ONCE(x, 1); WRITE_ONCE(foo.x, &x); /* * Release ordering to ensure that somebody following a non-NULL ptr will * see a fully-initialised 'foo'. smp_[w]mb() would work as well. */ smp_store_release(&ptr, &foo); /* CPU 1 */ int *xp1, *xp2, val; struct foo *foop; /* Load the global pointer and check that it's not NULL. */ foop = READ_ONCE(ptr); if (!foop) return; /* * Load 'foo.x' via the pointer we just loaded. This is ordered after the * previous READ_ONCE() because of the address dependency. */ xp1 = READ_ONCE(foop->x); /* * Load 'foo.x' directly via the global 'foo'. * _This is loading the same address as the previous READ_ONCE() and * therefore cannot return a stale (NULL) value!_ */ xp2 = READ_ONCE(foo.x); /* * Load 'x' via the pointer we just loaded. * _We may see zero here!_ */ val = READ_ONCE(*xp2); So in this case, the two same-location reads on CPU1 are actually executed out of order, but you can't tell just by looking at them in isolation because they returned the same value (i.e. xp1 == xp2 == &x). However, you *can* tell when you compose them in funny situations such as above (and I believe that this is demonstrably true on PPC and Arm; effectively the read-after-read ordering machinery only triggers in response to incoming snoops). There's probably a more-compelling variant using an (RCpc) load acquire instead of the last address dependency on CPU 1 as well. Anyway, I think all I'm trying to say is that I'd tend to shy away from relying on read-after-read ordering if it forms part of a more involved ordering relationship. > > But the patch looks fine, though I agree that the ordering in > > __wait_on_buffer should probably be moved into > > wait_on_bit/wait_on_bit_io. > > > > And would we perhaps want the bitops to have the different ordering > > versions? Like "set_bit_release()" and "test_bit_acquire()"? That > > would seem to be (a) cleaner and (b) possibly generate better code for > > architectures where that makes a difference? > > As always, I defer to the architecture maintainers on this one. FWIW, that makes sense to me. We already have release/acquire/releaxed variants of the bitops in the atomic_t API so it seems natural to have a counterpart in the actual bitops API as well. Will