On Fri, Jul 06, 2018 at 01:56:14PM +0200, Peter Zijlstra wrote: > > But I couldn't undertand what's wrong without the 1th smp_mb()? > > 1th smp_mb will make all ld/st finish before ldex.w. Is it necessary? > > Yes. > > CPU0 CPU1 > > r1 = READ_ONCE(x); WRITE_ONCE(y, 1); > r2 = xchg(&y, 2); smp_store_release(&x, 1); > > must not allow: r1==1 && r2==0 Also, since you said "SYNC.IS" is a pipeline flush, those instruction-sync primitives normally do not imply a store-buffer flush, does yours? If not it is not a valid smp_mb() implementation. Notably: CPU0 CPU1 WRITE_ONCE(x, 1); WRITE_ONCE(y, 1); smp_mb(); smp_mb(); r0 = READ_ONCE(y); r1 = READ_ONCE(x); must not allow: r0==0 && r1==0 Which would be possible with a regular instruction-sync barrier, but must absolutely not be true with a full memory barrier. (and you can replace the smp_mb(); r = READ_ONCE(); with r = xchg() to again see why you need that first smp_mb()).