Re: 16-bit store instructions &c?

"Paul E. McKenney" <paulmck@xxxxxxxxxx> · Thu, 29 Aug 2024 06:37:50 -0700

On Wed, Aug 28, 2024 at 10:01:06PM +0200, Arnd Bergmann wrote:
> On Wed, Aug 28, 2024, at 14:22, Paul E. McKenney wrote:
> > On Wed, Aug 28, 2024 at 01:48:41PM +0000, Arnd Bergmann wrote:
> >
> >> There is a related problem with ARM RiscPC, which
> >> uses a kernel built with -march=armv3, and that
> >> disallows 16-bit load/store instructions entirely,
> >> similar to how alpha ev5 and earlier lacked both
> >> byte and word access.
> >
> > And one left to go.  Progress, anyway.  ;-)
> 
> What I meant to say about this one is also that we can probably
> ignore it as well, since it's on the way out already, at the latest
> when gcc-9 becomes the minimum compiler, as gcc-8 was the last
> to support -march=armv3. We can also ask Russell if he's ok with
> dropping it earlier, as he is almost certainly the only user.

Even better, thank you!

My plan is to submit a pull request for the remaining three 8-bit
cmpxchg() emulation commits into the upcoming merge window.  In the
meantime, I will create similar patches for 16-bit cmpxchg() and perhaps
also both 8-bit and 16-bit xchg().  I will obviously CC both you and
Russell on the full set.  And if there are hardware-incompatibility
complaints, we can deal with them, whether by dropping the offending
pieces of my patches or by whatever other adjustments make sense.

Does that seem like a reasonable approach, or is there a better way?

> >> Everything else that I see has native load/store
> >> on 16-bit words and either has 16-bit atomics or
> >> can emulate them using the 32-bit ones.
> >> 
> >> However, the one thing that people usually
> >> want 16-bit xchg() for is qspinlock, and that
> >> one not only depends on it being atomic but also
> >> on strict forward-progress guarantees, which
> >> I think the emulated version can't provide
> >> in general.
> >> 
> >> This does not prevent architectures from doing
> >> it anyway.
> >
> > Given that the simpler spinlock does not provide forward-progress
> > guarantees, I don't see any reason that these guarantees cannot be voided
> > for architectures without native 16-bit stores and atomics.
> >
> > After all, even without those guarantees, qspinlock provides very real
> > benefits over simple spinlocks.
> 
> My understanding of this problem is that with a trivial bit spinlock,
> the worst case is that one task never gets the lock while others
> also want it, but a qspinlock based on a flawed xchg() implementation
> may end with none of the CPUs ever getting the lock. It may not
> matter in practice, but it does feel worse.

I could argue that there is no law saying that a flawed atomic operation
cannot cause a trivial bit spinlock to never be actually handed to any
CPU, but point taken.  Given that the emulated xchg() would be implemented
in terms of cmpxchg(), there is clearly less opportunity for the hardware
to "do the right thing" in terms of fairness and starvation.  After all,
the hardware very likekly has less visibility into a cmpxchg()-emulated
xchg() operation than into a hardware xchg() instruction.

Perhaps the best approach is a comment on the xchg() emulation stating
that it might offer weaker forward-progress guarantees.

An alternative approach is to emulate 16-bit cmpxchg(), and defer
emulation of 8-bit and 16-bit xchg().

Thoughts?

							Thanx, Paul