Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

Rik van Riel <riel@xxxxxxxxxx> · Wed, 13 Feb 2013 17:21:56 -0500

On 02/13/2013 02:36 PM, Linus Torvalds wrote:

On Wed, Feb 13, 2013 at 11:08 AM, Rik van Riel <riel@xxxxxxxxxx> wrote:

The spinlock backoff code prevents these last cases from
experiencing large performance regressions when the hardware
is upgraded.

I still want *numbers*.

There are real cases where backoff does exactly the reverse, and makes
things much much worse. The tuning of the backoff delays are often
*very* hardware sensitive, and upgrading hardware can turn out to do
exactly what you say - but for the backoff, not the regular spinning
code.

What kind of numbers would you like?

Numbers showing that the common case is not affected by this
code?

Or numbers showing that performance of something is improved
with this code?

Of course, the latter would point out a scalability issue,
that may be fixable by changing the data structures and code
involved :)

And we have hardware that actually autodetects some cacheline bouncing
patterns and may actually do a better job than software. It's *hard*
for software to know whether it's bouncing within the L1 cache between
threads, or across fabric in a large machine.

If we have just a few CPUs bouncing the cache line around, we
have no real performance issues and the loop with cpu_relax
seems to be doing fine.

Real issues arise when we have a larger number of CPUs contending
on the same lock. At that point we know we are no longer bouncing
between sibling cores.

As a car analogy, think of this not as an accelerator, but
as an airbag. Spinlock backoff (or other scalable locking
code) exists to keep things from going horribly wrong when
we hit a scalability wall.

Does that make more sense?

Not without tons of numbers from many different platforms, it doesn't.

That makes sense, especially if you are looking to make sure these
patches do not introduce regressions.

And not without explaining which spinlock it is that is so contended
in the first place.

This patch series is not as much for the spinlocks we know about
(we can probably fix those), but to prevent catastrophic
performance degradation when users run into contention on spinlocks
we do NOT know about.

So I claim:

  - it's *really* hard to trigger in real loads on common hardware.

This is definitely true.  However, with many millions of Linux
users out there, there will always be users who encounter a
performance problem, upgrade their hardware, and then run into
an even worse performance problem because they run into a
scalability issue.

The object of these patches is to go from:

"We doubled the number of CPUs, and now things go even slower.", to

"We doubled the number of CPUs, but things aren't going any faster."

The first is a real disaster for Linux users. Especially when the
workload consists of multiple things, some of which run faster on
the upgraded hardware, and others which run slower. What makes it
worse is that these kinds of issues always seem to happen on the
users' most expensive servers, running their most important
workloads...

The second case is something users can temporarily tolerate, while
we fix the underlying issue.

  - if it does trigger in any half-way reasonably common setup
(hardware/software), we most likely should work really hard at fixing
the underlying problem, not the symptoms.

Agreed.

  - we absolutely should *not* pessimize the common case for this

Agreed. Are there any preferred benchmarks you would absolutely like
to see, to show that these patches do not introduce regressions?

Hurting the 99.99% even a tiny amount should be something we should
never ever do. This is why I think the fast case is so important (and
I had another email about possibly making it acceptable), but I think
the *slow* case should be looked at a lot too. Because "back-off" is
absolutely *not* necessarily hugely better than plain spinning, and it
needs numbers. How many times do you spin before even looking at
back-off? How much do you back off?

Whether or not we go into back-off at all depends on a number
of factors, including how many waiters there are ahead of us
waiting for the same ticket spinlock, and whether we have spun
a lot of time on this lock in the past.

Btw, the "notice busy loops and turn it into mwait" is not some
theoretical magic thing. And it's exactly the kind of thing that
back-off *breaks* by making the loop too complex for hardware to
understand. Even just adding counters with conditionals that are *not*
about just he value loaded from memory suddently means that hardware
has a lot harder time doing things like that.

I don't know if you perhaps had some future plans of looking at using
mwait in the backoff code itself,

I looked at mwait, but the documentation suggests it only ever
works at cache line granularity (or larger). That means every
time another CPU adds itself to the queue for the ticket lock,
or every time the lock holder writes to something else in the
cache line the lock is in, mwait'ing cpus get woken up.

It looks like mwait is designed for the case of a lock being
on its own cache line, with nothing else. However, Linux does
not seem to have many of those locks left, and the contention
issues I see the most seem to involve locks embedded in data
structures, sharing the cache line with data that the lock
holder modifies.

In fact, it looks like going to mwait has the potential to
increase, rather than decrease, memory traffic.

but the patch I did see looked like
it might be absolutely horrible. How long does a "cpu_relax()" wait?
Do you know? How does "cpu_relax()" interface with the rest of the
CPU? Do you know? Because I've heard noises about cpu_relax() actually
affecting the memory pipe behavior of cache accesses of the CPU, and
thus the "cpu_relax()" in a busy loop that does *not* look at the
value (your "backoff delay loop") may actually work very very
differently from the cpu_relax() in the actual "wait for the value to
change" loop.

And how does that account for two different microarchitectures doing
totally different things? Maybe one uarch makes cpu_relax() just shut
down the front-end for a while, while another does something much
fancier and gives hints to in-flight memory accesses etc?

You summed up all the reasons for doing auto-tuning, aggressively
collapsing the back-off if we overslept, and starting out from 1
if we do not find a cached value.

When do you start doing mwait vs just busy-looping with cpu_relax? How
do you tune it to do the right thing for different architectures?

Mwait seems like the wrong thing to do, because the spinlocks where
we tend to see contention, tend to be the ones embedded in data
structures, with other things sharing the same cache line.

So I think this is complex. At many different levels. And it's *all*
about the details. No handwaving about how "back-off is like a air
bag". Because the big picture is entirely and totally irrelevant, when
the details are almost all that actually matter.

The details can vary a lot from CPU to CPU. I would argue that we
need code that does not break down, regardless of those details.

I agree that the code could use more regression testing, and will
get you test results. Please let me know if there are any particular
scenarios you would like to see tested.

That will also let us determine whether or not the call overhead (in
case we are waiting for a contended lock anyway) is an issue that
needs fixing.

--
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-tip-commits" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html