On 08/09/2021 15.58, Vlastimil Babka wrote:
On 9/8/21 15:05, Jesper Dangaard Brouer wrote:
On 08/09/2021 04.54, Andrew Morton wrote:
From: Vlastimil Babka <vbabka@xxxxxxx>
Subject: mm, slub: protect put_cpu_partial() with disabled irqs instead of cmpxchg
Jann Horn reported [1] the following theoretically possible race:
task A: put_cpu_partial() calls preempt_disable()
task A: oldpage = this_cpu_read(s->cpu_slab->partial)
interrupt: kfree() reaches unfreeze_partials() and discards the page
task B (on another CPU): reallocates page as page cache
task A: reads page->pages and page->pobjects, which are actually
halves of the pointer page->lru.prev
task B (on another CPU): frees page
interrupt: allocates page as SLUB page and places it on the percpu partial list
task A: this_cpu_cmpxchg() succeeds
which would cause page->pages and page->pobjects to end up containing
halves of pointers that would then influence when put_cpu_partial()
happens and show up in root-only sysfs files. Maybe that's acceptable,
I don't know. But there should probably at least be a comment for now
to point out that we're reading union fields of a page that might be
in a completely different state.
Additionally, the this_cpu_cmpxchg() approach in put_cpu_partial() is only
safe against s->cpu_slab->partial manipulation in ___slab_alloc() if the
latter disables irqs, otherwise a __slab_free() in an irq handler could
call put_cpu_partial() in the middle of ___slab_alloc() manipulating
->partial and corrupt it. This becomes an issue on RT after a local_lock
is introduced in later patch. The fix means taking the local_lock also in
put_cpu_partial() on RT.
After debugging this issue, Mike Galbraith suggested [2] that to avoid
different locking schemes on RT and !RT, we can just protect
put_cpu_partial() with disabled irqs (to be converted to
local_lock_irqsave() later) everywhere. This should be acceptable as it's
not a fast path, and moving the actual partial unfreezing outside of the
irq disabled section makes it short, and with the retry loop gone the code
can be also simplified. In addition, the race reported by Jann should no
longer be possible.
Based on my microbench[0] measurement changing preempt_disable to
local_irq_save will cost us 11 cycles (TSC). I'm not against the
change, I just want people to keep this in mind.
OK, but this is not a fast path for every allocation/free, so it gets
amortized. Also it eliminates a this_cpu_cmpxchg loop, and I'd expect
cmpxchg to be expensive too?
Added tests for this:
- this_cpu_cmpxchg cost: 5 cycles(tsc) 1.581 ns
- cmpxchg cost: 18 cycles(tsc) 5.006 ns
On my E5-1650 v4 @ 3.60GHz:
- preempt_disable(+enable) cost: 11 cycles(tsc) 3.161 ns
- local_irq_save (+restore) cost: 22 cycles(tsc) 6.331 ns
Notice the non-save/restore variant is superfast:
- local_irq_disable(+enable) cost: 6 cycles(tsc) 1.844 ns
It actually surprises me that it's that cheap, and would have expected
changing the irq state would be the costly part, not the saving/restoring.
Incidentally, would you know what's the cost of save+restore when the
irqs are already disabled, so it's effectively a no-op?
The non-save variant simply translated onto CLI and STI, which seems to
be very fast.
The cost of save+restore when the irqs are already disabled is the same
(did a quick test).
Cannot remember who told me, but (apparently) the expensive part is
reading the CPU FLAGS.
I did a quick test with:
/** Loop to measure **/
for (i = 0; i < rec->loops; i++) {
local_irq_save(flags);
loops_cnt++;
barrier();
//local_irq_restore(flags);
local_irq_enable();
}
Doing a save + enable: This cost 21 cycles(tsc) 6.015 ns.
(Cost before was 22 cycles)
This confirms reading the CPU FLAGS seems to be the expensive part.
--Jesper