On Wed, Jan 13, 2021 at 8:14 PM Vlastimil Babka <vbabka@xxxxxxx> wrote: > On 1/12/21 12:12 AM, Jann Horn wrote: > It doesn't help that slabinfo (global or per-memcg) is also > inaccurate as it cannot count free objects on per-cpu partial slabs and thus > reports them as active. Maybe SLUB could be taught to track how many objects are in the percpu machinery, and then print that number separately so that you can at least know how much data you're missing without having to collect data with IPIs... > > It might be a good idea to figure out whether it is possible to > > efficiently keep track of a more accurate count of the free objects on > > As long as there are some inuse objects, it shouldn't matter much if the slab is > sitting on per-cpu partial list or per-node list, as it can't be freed anyway. > It becomes a real problem only after the slab become fully free. If we detected > that in __slab_free() also for already-frozen slabs, we would need to know which > CPU this slab belongs to (currently that's not tracked afaik), Yeah, but at least on 64-bit systems we still have 32 completely unused bits in the counter field that's updated via cmpxchg_double on struct page. (On 32-bit systems the bitfields are also wider than they strictly need to be, I think, at least if the system has 4K page size.) So at least on 64-bit systems, we could squeeze a CPU number in there, and then you'd know to which CPU the page belonged at the time the object was freed. > and send it an > IPI to do some light version of unfreeze_partials() that would only remove empty > slabs. The trick would be not to cause too many IPI's by this, obviously :/ Some brainstorming: Maybe you could have an atomic counter in kmem_cache_cpu that tracks the number of empty frozen pages that are associated with a specific CPU? So the freeing slowpath would do its cmpxchg_double, and if the new state after a successful cmpxchg_double is "inuse==0 && frozen == 1" with a valid CPU number, you afterwards do "atomic_long_inc(&per_cpu_ptr(cache->cpu_slab, cpu)->empty_partial_pages)". I think it should be possible to implement that such that the empty_partial_pages count, while not immediately completely accurate, would be eventually consistent; and readers on the CPU owning the kmem_cache_cpu should never see a number that is too large, only one that is too small. You could additionally have a plain percpu counter, not tied to the kmem_cache, and increment it by 1<<page_order - then that would track the amount of memory you could reclaim by sending an IPI to a given CPU core. Then that threshold could help decide whether it's worth sending IPIs from SLUB and/or the shrinker?