Re: [PATCH v2] make buffer_locked provide an acquire semantics

Will Deacon <will@xxxxxxxxxx> · Mon, 1 Aug 2022 16:41:09 +0100

Hi Linus, Paul,

Apologies for the slow response here; believe it or not, I was attending
a workshop about memory ordering.

On Sun, Jul 31, 2022 at 10:30:11AM -0700, Paul E. McKenney wrote:
> On Sun, Jul 31, 2022 at 09:51:47AM -0700, Linus Torvalds wrote:
> > Even alpha is specified to be locally ordered wrt *one* memory
> > location, including for reads (See table 5-1: "Processor issue order",
> > and also 5.6.2.2: "Litmus test 2"). So if a previous read has seen a
> > new value, a subsequent read is not allowed to see an older one - even
> > without a memory barrier.
> > 
> > Will, Paul? Maybe that's only for overlapping loads/stores, not for
> > loads/loads. Because maybe alpha for once isn't the weakest possible
> > ordering.
> 
> The "bad boy" in this case is Itanium, which can do some VLIW reordering
> of accesses.  Or could, I am not sure that newer Itanium hardware
> does this.  But this is why Itanium compilers made volatile loads use
> the ld,acq instruction.
> 
> Which means that aligned same-sized marked accesses to a single location
> really do execute consistently with some global ordering, even on Itanium.

Although this is true, there's a really subtle issue which crops up if you
try to compose this read-after-read ordering with dependencies in the case
where the two reads read the same value (which is encapsulated by the
unusual RSW litmus test that I've tried to convert to C below):

/* Global definitions; assume everything zero-initialised */
struct foo {
	int *x;
};

int x;
struct foo foo;
struct foo *ptr;

/* CPU 0 */
WRITE_ONCE(x, 1);
WRITE_ONCE(foo.x, &x);

/*
 * Release ordering to ensure that somebody following a non-NULL ptr will
 * see a fully-initialised 'foo'. smp_[w]mb() would work as well.
 */
smp_store_release(&ptr, &foo);

/* CPU 1 */
int *xp1, *xp2, val;
struct foo *foop;

/* Load the global pointer and check that it's not NULL. */
foop = READ_ONCE(ptr);
if (!foop)
	return;

/*
 * Load 'foo.x' via the pointer we just loaded. This is ordered after the
 * previous READ_ONCE() because of the address dependency.
 */
xp1 = READ_ONCE(foop->x);

/*
 * Load 'foo.x' directly via the global 'foo'.
 * _This is loading the same address as the previous READ_ONCE() and
 *  therefore cannot return a stale (NULL) value!_
 */
xp2 = READ_ONCE(foo.x);

/*
 * Load 'x' via the pointer we just loaded.
 * _We may see zero here!_
 */
val = READ_ONCE(*xp2);

So in this case, the two same-location reads on CPU1 are actually executed
out of order, but you can't tell just by looking at them in isolation
because they returned the same value (i.e. xp1 == xp2 == &x). However, you
*can* tell when you compose them in funny situations such as above (and I
believe that this is demonstrably true on PPC and Arm; effectively the
read-after-read ordering machinery only triggers in response to incoming
snoops).

There's probably a more-compelling variant using an (RCpc) load acquire
instead of the last address dependency on CPU 1 as well.

Anyway, I think all I'm trying to say is that I'd tend to shy away from
relying on read-after-read ordering if it forms part of a more involved
ordering relationship.

> > But the patch looks fine, though I agree that the ordering in
> > __wait_on_buffer should probably be moved into
> > wait_on_bit/wait_on_bit_io.
> > 
> > And would we perhaps want the bitops to have the different ordering
> > versions? Like "set_bit_release()" and "test_bit_acquire()"? That
> > would seem to be (a) cleaner and (b) possibly generate better code for
> > architectures where that makes a difference?
> 
> As always, I defer to the architecture maintainers on this one.

FWIW, that makes sense to me. We already have release/acquire/releaxed
variants of the bitops in the atomic_t API so it seems natural to have
a counterpart in the actual bitops API as well.

Will