Re: [patch 031/147] mm, slub: protect put_cpu_partial() with disabled irqs instead of cmpxchg

David Hildenbrand <david@xxxxxxxxxx> · Wed, 8 Sep 2021 19:24:32 +0200

On 08.09.21 19:14, Jesper Dangaard Brouer wrote:

On 08/09/2021 16.59, David Hildenbrand wrote:
On 08.09.21 16:55, David Hildenbrand wrote:
On 08.09.21 15:58, Vlastimil Babka wrote:
On 9/8/21 15:05, Jesper Dangaard Brouer wrote:

On 08/09/2021 04.54, Andrew Morton wrote:
From: Vlastimil Babka <vbabka@xxxxxxx>
Subject: mm, slub: protect put_cpu_partial() with disabled irqs
instead of cmpxchg

Jann Horn reported [1] the following theoretically possible race:

      task A: put_cpu_partial() calls preempt_disable()
      task A: oldpage = this_cpu_read(s->cpu_slab->partial)
      interrupt: kfree() reaches unfreeze_partials() and discards
the page
      task B (on another CPU): reallocates page as page cache
      task A: reads page->pages and page->pobjects, which are actually
      halves of the pointer page->lru.prev
      task B (on another CPU): frees page
      interrupt: allocates page as SLUB page and places it on the
percpu partial list
      task A: this_cpu_cmpxchg() succeeds

      which would cause page->pages and page->pobjects to end up
containing
      halves of pointers that would then influence when
put_cpu_partial()
      happens and show up in root-only sysfs files. Maybe that's
acceptable,
      I don't know. But there should probably at least be a comment
for now
      to point out that we're reading union fields of a page that
might be
      in a completely different state.

Additionally, the this_cpu_cmpxchg() approach in put_cpu_partial()
is only
safe against s->cpu_slab->partial manipulation in ___slab_alloc()
if the
latter disables irqs, otherwise a __slab_free() in an irq handler
could
call put_cpu_partial() in the middle of ___slab_alloc() manipulating
->partial and corrupt it.  This becomes an issue on RT after a
local_lock
is introduced in later patch.  The fix means taking the local_lock
also in
put_cpu_partial() on RT.

After debugging this issue, Mike Galbraith suggested [2] that to avoid
different locking schemes on RT and !RT, we can just protect
put_cpu_partial() with disabled irqs (to be converted to
local_lock_irqsave() later) everywhere.  This should be acceptable
as it's
not a fast path, and moving the actual partial unfreezing outside
of the
irq disabled section makes it short, and with the retry loop gone
the code
can be also simplified.  In addition, the race reported by Jann
should no
longer be possible.

Based on my microbench[0] measurement changing preempt_disable to
local_irq_save will cost us 11 cycles (TSC).  I'm not against the
change, I just want people to keep this in mind.

OK, but this is not a fast path for every allocation/free, so it gets
amortized. Also it eliminates a this_cpu_cmpxchg loop, and I'd expect
cmpxchg to be expensive too?

On my E5-1650 v4 @ 3.60GHz:
     - preempt_disable(+enable)  cost: 11 cycles(tsc) 3.161 ns
     - local_irq_save (+restore) cost: 22 cycles(tsc) 6.331 ns

Notice the non-save/restore variant is superfast:
     - local_irq_disable(+enable) cost: 6 cycles(tsc) 1.844 ns

It actually surprises me that it's that cheap, and would have expected
changing the irq state would be the costly part, not the
saving/restoring.
Incidentally, would you know what's the cost of save+restore when the
irqs are already disabled, so it's effectively a no-op?

It surprises me as well. That would imply that protecting short RCU
sections using

local_irq_disable
local_irq_enable

instead of via

preempt_disable
preempt_enable

would actually be very beneficial.

Please don't draw this as a general conclusion.
As Linus describe in details, the IRQ disable/enable will be very
micro-arch specific.

Sure: but especially for modern micro-archs, this might be very relevant.

I actually stumbled over this exact question 1 month ago, that's why 
your comment caught my attention. I looked for CLI/STI cycle numbers and 
didn't really find a trusted source. I merely only found [1], which made 
it look like incrementing/decrementing some counter would actually be 
much faster most of the time.

[1] https://www.agner.org/optimize/instruction_tables.pdf

The preempt_disable/enable will likely be more stable/consistent across
micro-archs.
Keep an eye out for kernel config options when juding
preempt_disable/enable performance [1]

[1]
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_sample.c#L363-L367

Are the numbers trustworthy? :)

.. and especially did the benchmark consider side effects of
enabling/disabling interrupts (pipeline flushes etc ..)?

Of-cause not, this is a microbenchmark... they are per definition not
trustworthy :-P

:)

--
Thanks,

David / dhildenb