On Fri, Jun 18, 2021 at 05:38:37PM -0700, Jakub Kicinski wrote: > On Fri, 18 Jun 2021 17:30:47 -0700 Jakub Kicinski wrote: > > On Thu, 17 Jun 2021 09:04:14 +0800 Yunsheng Lin wrote: > > > The spin_trylock() was assumed to contain the implicit > > > barrier needed to ensure the correct ordering between > > > STATE_MISSED setting/clearing and STATE_MISSED checking > > > in commit a90c57f2cedd ("net: sched: fix packet stuck > > > problem for lockless qdisc"). > > > > > > But it turns out that spin_trylock() only has load-acquire > > > semantic, for strongly-ordered system(like x86), the compiler > > > barrier implicitly contained in spin_trylock() seems enough > > > to ensure the correct ordering. But for weakly-orderly system > > > (like arm64), the store-release semantic is needed to ensure > > > the correct ordering as clear_bit() and test_bit() is store > > > operation, see queued_spin_lock(). > > > > > > So add the explicit barrier to ensure the correct ordering > > > for the above case. > > > > > > Fixes: a90c57f2cedd ("net: sched: fix packet stuck problem for lockless qdisc") > > > Signed-off-by: Yunsheng Lin <linyunsheng@xxxxxxxxxx> > > > > Acked-by: Jakub Kicinski <kuba@xxxxxxxxxx> > > Actually.. do we really need the _before_atomic() barrier? > I'd think we only need to make sure we re-check the lock > after we set the bit, ordering of the first check doesn't > matter. When debugging pointed to the misordering between STATE_MISSED setting/clearing and STATE_MISSED checking, only _after_atomic() was added first, and it did not fix the misordering problem, when both _before_atomic() and _after_atomic() were added, the misordering problem disappeared. I suppose _before_atomic() matters because the STATE_MISSED setting and the lock rechecking is only done when first check of STATE_MISSED returns false. _before_atomic() is used to make sure the first check returns correct result, if it does not return the correct result, then we may have misordering problem too. cpu0 cpu1 clear MISSED _after_atomic() dequeue enqueue first trylock() #false MISSED check #*true* ? As above, even cpu1 has a _after_atomic() between clearing STATE_MISSED and dequeuing, we might stiil need a barrier to prevent cpu0 doing speculative MISSED checking before cpu1 clearing MISSED? And the implicit load-acquire barrier contained in the first trylock() does not seems to prevent the above case too. And there is no load-acquire barrier in pfifo_fast_dequeue() too, which possibly make the above case more likely to happen.