On Wed, Aug 17, 2022 at 01:22:09AM +0900, Hector Martin wrote: > On 16/08/2022 23.55, Boqun Feng wrote: > > On Tue, Aug 16, 2022 at 02:41:57PM +0100, Will Deacon wrote: > >> It's worth noting that with the spinlock-based implementation (i.e. > >> prior to e986a0d6cb36) then we would have the same problem on > >> architectures that implement spinlocks with acquire/release semantics; > >> accesses from outside of the critical section can drift in and reorder > >> with each other there, so the conversion looked legitimate to me in > >> isolation and I vaguely remember going through callers looking for > >> potential issues. Alas, I obviously missed this case. > >> > > > > I just to want to mention that although spinlock-based atomic bitops > > don't provide the full barrier in test_and_set_bit(), but they don't > > have the problem spotted by Hector, because test_and_set_bit() and > > clear_bit() sync with each other via locks: > > > > test_and_set_bit(): > > lock(..); > > old = *p; // mask is already set by other test_and_set_bit() > > *p = old | mask; > > unlock(...); > > clear_bit(): > > lock(..); > > *p ~= mask; > > unlock(..); > > > > so "having a full barrier before test_and_set_bit()" may not be the > > exact thing we need here, as long as a test_and_set_bit() can sync with > > a clear_bit() uncondiontally, then the world is safe. For example, we > > can make test_and_set_bit() RELEASE, and clear_bit() ACQUIRE on ARM64: > > > > test_and_set_bit(): > > atomic_long_fetch_or_release(..); // pair with clear_bit() > > // guarantee everything is > > // observed. > > clear_bit(): > > atomic_long_fetch_andnot_acquire(..); > > > > , maybe that's somewhat cheaper than a full barrier implementation. > > > > Thoughts? Just to find the exact ordering requirement for bitops. > > It's worth pointing out that the workqueue code does *not* pair > test_and_set_bit() with clear_bit(). It does an atomic_long_set() > instead (and then there are explicit barriers around it, which are > expected to pair with the implicit barrier in test_and_set_bit()). If we > define test_and_set_bit() to only sync with clear_bit() and not > necessarily be a true barrier, that breaks the usage of the workqueue code. > Ah, I miss that, but that means the old spinlock-based atomics are totally broken unless spinlock means full barriers on these archs. But still, if we define test_and_set_bit() as RELEASE atomic instead of a full barrier + atomic, it should work for workqueue, right? Do we actually need extra ordering here? WRITE_ONCE(*x, 1); // A test_and_set_bit(..); // a full barrier will order A & B WRITE_ONCE(*y, 1); // B That's something I want to figure out. Regards, Boqun > - Hector