On Mon, Jul 24, 2017 at 10:09:32PM +1000, Michael Ellerman wrote: > Peter Zijlstra <peterz@xxxxxxxxxxxxx> writes: > > anyway, and the fact that your LL/SC is horrendously slow in any case. > > Boo :/ :-) > Just kidding. I suspect you're right that we can probably pack a > reasonable amount of tests in the body of the LL/SC and not notice. > > > Also, I still haven't seen an actual benchmark where our cmpxchg loop > > actually regresses anything, just a lot of yelling about potential > > regressions :/ > > Heh yeah. Though I have looked at the code it generates on PPC and it's > not sleek, though I guess that's not a benchmark is it :) Oh for sure, GCC still can't sanely convert a cmpxchg loop (esp. if the cmpxchg is implemented using asm) into a native LL/SC sequence, so the generic code will end up looking pretty horrendous. A native implementation of the same semantics should look loads better. One thing that might help you is that refcount_dec_and_test() is weaker than atomic_dec_and_test() wrt ordering, so that might help some (RELEASE vs fully ordered).