On Wed, Mar 31, 2021 at 11:22:35PM +0800, Guo Ren wrote: > On Mon, Mar 29, 2021 at 8:50 PM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > > > > On Mon, Mar 29, 2021 at 08:01:41PM +0800, Guo Ren wrote: > > > u32 a = 0x55aa66bb; > > > u16 *ptr = &a; > > > > > > CPU0 CPU1 > > > ========= ========= > > > xchg16(ptr, new) while(1) > > > WRITE_ONCE(*(ptr + 1), x); > > > > > > When we use lr.w/sc.w implement xchg16, it'll cause CPU0 deadlock. > > > > Then I think your LL/SC is broken. > No, it's not broken LR.W/SC.W. Quote <8.3 Eventual Success of > Store-Conditional Instructions>: > > "As a consequence of the eventuality guarantee, if some harts in an > execution environment are executing constrained LR/SC loops, and no > other harts or devices in the execution environment execute an > unconditional store or AMO to that reservation set, then at least one > hart will eventually exit its constrained LR/SC loop. By contrast, if > other harts or devices continue to write to that reservation set, it > is not guaranteed that any hart will exit its LR/SC loop." (there, reflowed it for you) That just means your arch spec is broken too :-) > So I think it's a feature of LR/SC. How does the above code (also use > ll.w/sc.w to implement xchg16) running on arm64? > > 1: ldxr > eor > cbnz ... 2f > stxr > cbnz ... 1b // I think it would deadlock for arm64. > > "LL/SC fwd progress" which you have mentioned could guarantee stxr > success? How hardware could do that? I'm not a hardware person; I've never actually build anything larger than a 4 bit adder with nand gates (IIRC, 25+ years ago). And I'll let Will answer the ARM64 part. That said, I think the idea is that if you lock the line (load-locked is a clue ofcourse) to the core until either: an exception (or anything else that is guaranteed to fail LL/SC), SC or N instructions, then a competing LL/SC will stall in the LL while the first core makes progress. This same principle is key to hardware progress for cmpxchg/cas loops, don't instantly yield the exclusive hold on the cacheline, keep it around for a while. Out-of-order CPUs can do even better I think, by virtue of them being able to see tight loops. Anyway, given you have such a crap architecture (and here I thought RISC-V was supposed to be a modern design *sigh*), you had better go look at the sparc64 atomic implementation which has a software backoff for failed CAS in order to make fwd progress.