On Thu, Feb 14, 2019 at 9:51 AM Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > > The arm64 numbers scaled horribly even before, and that's because > there is too much ping-pong, and it's probably because there is no > "stickiness" to the cacheline to the core, and thus adding the extra > loop can make the ping-pong issue even worse because now there is more > of it. Actually, if it's using the ll/sc, then I don't see why arm64 should even change. It doesn't really even change the pattern: the initial load of the value is just replaced with a "ll" that gets a non-zero value, and then we re-try without even doing the "sc" part. End result: exact same "load once, then do ll/sc to update". Just using a slightly different instruction pattern. But maybe "ll" does something different to the cacheline than a regular "ld"? Alternatively, the machine you used is using LSE, and the "swp" thing has some horrid behavior when it fails. So I take it back. I'm actually surprised that arm64 performs worse. I don't think it should. But numbers walk, bullshit talks, and it clearly does make for worse numbers on arm64. Linus