Re: [patch 031/147] mm, slub: protect put_cpu_partial() with disabled irqs instead of cmpxchg

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Wed, 8 Sep 2021 09:31:09 -0700

On Wed, Sep 8, 2021 at 9:11 AM Jesper Dangaard Brouer
<jbrouer@xxxxxxxxxx> wrote:
>
> The non-save variant simply translated onto CLI and STI, which seems to
> be very fast.

It will depend on the microarchitecture.

Happily:

> The cost of save+restore when the irqs are already disabled is the same
> (did a quick test).

The really expensive part used to be P4. 'popf' was hundreds of cycles
if any of the non-arithmetic bits changed, iirc.

P4 used to be a big headache just because of things like that -
straightforward code ran very well, but anything a bit more special
took forever because it flushed the pipeline.

So some of our optimizations may be historic because of things like
that. We don't really need to worry about the P4 glass jaws any more,
but it *used* to be much quicker to do 'preempt_disable()' that just
does an add to a memory location than it was to disable interrupts.

> Cannot remember who told me, but (apparently) the expensive part is
> reading the CPU FLAGS.

Again, it ends up being very dependent on the uarch.

Reading and writing the flags register is somewhat expensive because
it's not really "one" register in hardware any more (even if that was
obviously the historical implementation).

These days, the arithmetic flags are generally multiple renamed
registers, and then the other flags are a separate system register
(possibly multiple bits spread out).

The cost of doing those flag reads and writes are hard to really
specify, because in an OoO architecture a lot of it ends up being "how
much of that can be done in parallel, and what's the pipeline
serialization cost". Doing a loop with rdtsc is not necessarily AT ALL
indicative of the cost when there is other real code around it.

The cost _could_ be much smaller, in case there is little
serialization with normal other code. Or, it could be much bigger than
what a rdtsc shows, because if it's a hard pipeline flush, then a
tight loop with those things won't have any real work to flush, while
in "real code" there may be hundreds of instructions in flight and
doing the flush is very expensive.

The good news is that afaik, all the modern x86 CPU microarchitectures
do reasonably well. And while a "pushf/cli/popf" sequence is probably
more cycles than an add/subtract one in a benchmark, if the preempt
counter is not otherwise needed, and is cold in the cache, then the
pushf/cli/popf may be *much* cheaper than a cache miss.

So the only way to really tell would be to run real benchmarks of real
loads on multiple different microarchitectures.

I'm pretty sure the actual result is: "you can't measure the 10-cycle
difference on any modern core because it can actually go either way".

But "I'm pretty sure" and "reality" are not the same thing.

These days, pipeline flushes and cache misses (and then as a very
particularly bad case - cache line pingpong issues) are almost the
only thing that matters.

And the most common reason by far for the pipeline flushes are branch
mispredicts, but see above: the system bits in the flags register
_have_ been cause of them in the past, so it's not entirely
impossible.

               Linus