On Tue, Jul 27, 2021 at 09:52:26AM +0800, Wang Rui wrote: > I think the forward progress are guaranteed while all operations are > atomic(ll/sc or amo). If ll/sc runs on a fast cpu, there will be > random delays, is that okay? Else, for such hardware, we can't even > implement generic spinlock with ll/sc. > > And I also think that the hardware supports normal store for > unlocking. (e.g. arch_spin_unlock) > > In qspinlock, when _Q_PENDING_BITS == 1, it's available for all > hardware, because the clear_pending/clear_pending_set_locked are all > atomic operations. Isn't it? > > Q: Why live lock happens while _Q_PENDING_BITS == 8? > A: I found a case is: > > * CPU A updates sub-word of qpsinlock at high frequency with normal store. > * CPU B do xchg_tail with load + cmpxchg, and the value of load is always not equal to the value of ll(cmpxchg). > > qspinlock: > 0: locked > 1: pending > 2: tail > > CPU A CPU B > 1: 1: <--------------------+ > sh $newval, &locked lw $v1, &qspinlock | > add $newval, 1 and $t1, $v1, ~mask | > b 1b or $t1, $t1, newval | (live lock path) > ll $v2, &qspinlock | > bne $v1, $v2, 1b -----+ > sc $t1, &qspinlock > beq $t1, 0, 1b > > If xchg_tail like this, at least there is no live lock on Loongson > > xchg_tail: > > 1: > ll $v1, &qspinlock > and $t1, $v1, ~mask > or $t1, $t1, newval > sc $t1, &qspinlock > beq $t1, 0, 1b > > For hardware that ll/sc is based on cache coherency, I think sc is > easy to succeed. The ll makes cache-line is exclusive by CPU B, and > the store of CPU A needs to acquire exclusive again, the sc may be > completed before this. This! I've been saying this for ages. All those xchg16() implementations are broken for using cmpxchg() on LL/SC. Not because xchg16() is fundamentally flawed. Perhaps we should introduce: atomic_nand_or() and atomic_fetch_nand_or() and implement short xchg() using those, then we can have the whole masks setup shared. It just means you get to implement those primitives for *all* archs :-) Also, the _Q_PENDING_BITS==1 case can use that primitive.