On 1/12/21 12:12 AM, Jann Horn wrote: > [This is not something I intend to work on myself. But since I > stumbled over this issue, I figured I should at least document/report > it, in case anyone is willing to pick it up.] > > Hi! Hi, thanks for saving me a lot of typing! ... > This means that in practice, SLUB actually ends up keeping as many > **pages** on the percpu partial lists as it intends to keep **free > objects** there. Yes, I concluded the same thing. ... > I suspect that this may have also contributed to the memory wastage > problem with memory cgroups that was fixed in v5.9 > (https://lore.kernel.org/linux-mm/20200623174037.3951353-1-guro@xxxxxx/); > meaning that servers with lots of CPU cores running pre-5.9 kernels > with memcg and systemd (which tends to stick every service into its > own memcg) might be even worse off. Very much yes. Investigating an increase of kmemcg usage of a workload between an older kernel with SLAB and 5.3-based kernel with SLUB led us to find the same issue as you did. It doesn't help that slabinfo (global or per-memcg) is also inaccurate as it cannot count free objects on per-cpu partial slabs and thus reports them as active. I was aware that some empty slab pages might linger on per-cpu lists, but only after seeing how many were freed after "echo 1 > .../shrink" made me realize the extent of this. > It also seems unsurprising to me that flushing ~30 pages out of the > percpu partial caches at once with IRQs disabled would cause tail > latency spikes (as noted by Joonsoo Kim and Christoph Lameter in > commit 345c905d13a4e "slub: Make cpu partial slab support > configurable"). > > At first I thought that this wasn't a significant issue because SLUB > has a reclaim path that can trim the percpu partial lists; but as it > turns out, that reclaim path is not actually wired up to the page > allocator's reclaim logic. The SLUB reclaim stuff is only triggered by > (very rare) subsystem-specific calls into SLUB for specific slabs and > by sysfs entries. So in userland processes will OOM even if SLUB still > has megabytes of entirely unused pages lying around. Yeah, we considered to wire the shrinking to memcg OOM, but it's a poor solution. I'm considering introducing a proper shrinker that would be registered and work like other shrinkers for reclaimable caches. Then we would make it memcg-aware in our backport - upstream after v5.9 doesn't need that obviously. > It might be a good idea to figure out whether it is possible to > efficiently keep track of a more accurate count of the free objects on As long as there are some inuse objects, it shouldn't matter much if the slab is sitting on per-cpu partial list or per-node list, as it can't be freed anyway. It becomes a real problem only after the slab become fully free. If we detected that in __slab_free() also for already-frozen slabs, we would need to know which CPU this slab belongs to (currently that's not tracked afaik), and send it an IPI to do some light version of unfreeze_partials() that would only remove empty slabs. The trick would be not to cause too many IPI's by this, obviously :/ Actually I'm somewhat wrong above. If a CPU and per-node partial list runs out of free objects, it's wasteful to allocate new slabs if almost-empty slabs sit on another CPU's per-node partial list. > percpu partial lists; and if not, maybe change the accounting to > explicitly track the number of partial pages, and use limits that are That would be probably the simplest solution. Maybe sufficient upstream where the wastage only depends on number of caches and not memcgs. For pre-5.9 I also considered limiting the number of pages only for the per-memcg clones :/ Currently writing to the /sys/.../<cache>/cpu_partial file is propagated to all the clones and root cache. > more appropriate for that? And perhaps the page allocator reclaim path > should also occasionally rip unused pages out of the percpu partial > lists? That would be best done by the a shrinker? BTW, SLAB does this by reaping of its per-cpu and shared arrays by timers (which works, but is not ideal) They also can't grow that large like this.