Re: SLUB: percpu partial object count is highly inaccurate, causing some memory wastage and maybe also worse tail latencies?

Jann Horn <jannh@xxxxxxxxxx> · Wed, 13 Jan 2021 23:37:34 +0100

On Wed, Jan 13, 2021 at 8:14 PM Vlastimil Babka <vbabka@xxxxxxx> wrote:
> On 1/12/21 12:12 AM, Jann Horn wrote:
> It doesn't help that slabinfo (global or per-memcg) is also
> inaccurate as it cannot count free objects on per-cpu partial slabs and thus
> reports them as active.

Maybe SLUB could be taught to track how many objects are in the percpu
machinery, and then print that number separately so that you can at
least know how much data you're missing without having to collect data
with IPIs...

> > It might be a good idea to figure out whether it is possible to
> > efficiently keep track of a more accurate count of the free objects on
>
> As long as there are some inuse objects, it shouldn't matter much if the slab is
> sitting on per-cpu partial list or per-node list, as it can't be freed anyway.
> It becomes a real problem only after the slab become fully free. If we detected
> that in __slab_free() also for already-frozen slabs, we would need to know which
> CPU this slab belongs to (currently that's not tracked afaik),

Yeah, but at least on 64-bit systems we still have 32 completely
unused bits in the counter field that's updated via cmpxchg_double on
struct page. (On 32-bit systems the bitfields are also wider than they
strictly need to be, I think, at least if the system has 4K page
size.) So at least on 64-bit systems, we could squeeze a CPU number in
there, and then you'd know to which CPU the page belonged at the time
the object was freed.

> and send it an
> IPI to do some light version of unfreeze_partials() that would only remove empty
> slabs. The trick would be not to cause too many IPI's by this, obviously :/

Some brainstorming:

Maybe you could have an atomic counter in kmem_cache_cpu that tracks
the number of empty frozen pages that are associated with a specific
CPU? So the freeing slowpath would do its cmpxchg_double, and if the
new state after a successful cmpxchg_double is "inuse==0 && frozen ==
1" with a valid CPU number, you afterwards do
"atomic_long_inc(&per_cpu_ptr(cache->cpu_slab,
cpu)->empty_partial_pages)". I think it should be possible to
implement that such that the empty_partial_pages count, while not
immediately completely accurate, would be eventually consistent; and
readers on the CPU owning the kmem_cache_cpu should never see a number
that is too large, only one that is too small.

You could additionally have a plain percpu counter, not tied to the
kmem_cache, and increment it by 1<<page_order - then that would track
the amount of memory you could reclaim by sending an IPI to a given
CPU core. Then that threshold could help decide whether it's worth
sending IPIs from SLUB and/or the shrinker?