On Thu, Feb 14, 2019 at 6:53 AM Waiman Long <longman@xxxxxxxxxx> wrote: > > The ARM64 result is what I would have expected given that the change was > to optimize for the uncontended case. The x86-64 result is kind of an > anomaly to me, but I haven't bothered to dig into that. I would say that the ARM result is what I'd expect from something that scales badly to begin with. The x86-64 result is the expected one: yes, the cmpxchg is done one extra time, but it results in fewer cache transitions (the cacheline never goes into "shared" state), and cache transitions are what matter. The cost of re-doing the instruction should be low. The cacheline ping-pong and the cache coherency messages is what hurts. So I actually think both are very easily explained. The x86-64 number improves, because there is less cache coherency traffic. The arm64 numbers scaled horribly even before, and that's because there is too much ping-pong, and it's probably because there is no "stickiness" to the cacheline to the core, and thus adding the extra loop can make the ping-pong issue even worse because now there is more of it. The cachelines not sticking at all to a core probably is good for fairness issues (in particular, sticking *too* much can cause horrible issues), but it's absolutely horrible if it means that you lose the cacheline even before you get to complete the second cmpxchg. Linus