On Thu, Jan 21, 2021 at 06:21:54PM +0100, Vlastimil Babka wrote: > For performance reasons, SLUB doesn't keep all slabs on shared lists and > doesn't always free slabs immediately after all objects are freed. Namely: > > - for each cache and cpu, there might be a "CPU slab" page, partially or fully > free > - with SLUB_CPU_PARTIAL enabled (default y), there might be a number of "percpu > partial slabs" for each cache and cpu, also partially or fully free > - for each cache and numa node, there are caches on per-node partial list, up > to 10 of those may be empty > > As Jann reports [1], the number of percpu partial slabs should be limited by > number of free objects (up to 30), but due to imprecise accounting, this can > deterioriate so that there are up to 30 free slabs. He notes: > > > Even on an old-ish Android phone (Pixel 2), with normal-ish usage, I > > see something like 1.5MiB of pages with zero inuse objects stuck in > > percpu lists. > > My observations match Jann's, and we've seen e.g. cases with 10 free slabs per > cpu. We can also confirm Jann's theory that on kernels pre-kmemcg rewrite (in > v5.9), this issue is amplified as there are separate sets of kmem caches with > cpu caches, per-cpu partial and per-node partial lists for each memcg and cache > that deals with kmemcg-accounted objects. > > The cached free slabs can therefore become a memory waste, making memory > pressure higher, causing more reclaim of actually used LRU pages, and even > cause OOM (global, or memcg on older kernels). > > SLUB provides __kmem_cache_shrink() that can flush all the abovementioned > slabs, but is currently called only in rare situations, or from a sysfs > handler. The standard way to cooperate with reclaim is to provide a shrinker, > and so this patch adds such shrinker to call __kmem_cache_shrink() > systematically. > > The shrinker design is however atypical. The usual design assumes that a > shrinker can easily count how many objects can be reclaimed, and then reclaim > given number of objects. For SLUB, determining the number of the various cached > slabs would be a lot of work, and controlling how many to shrink precisely > would be impractical. Instead, the shrinker is based on reclaim priority, and > on lowest priority shrinks a single kmem cache, while on highest it shrinks all > of them. To do that effectively, there's a new list caches_to_shrink where > caches are taken from its head and then moved to tail. Existing slab_caches > list is unaffected so that e.g. /proc/slabinfo order is not disrupted. > > This approach should not cause excessive shrinking and IPI storms: > > - If there are multiple reclaimers in parallel, only one can proceed, thanks to > mutex_trylock(&slab_mutex). After unlocking, caches that were just shrinked > are at the tail of the list. > - in flush_all(), we actually check if there's anything to flush by a CPU > (has_cpu_slab()) before sending an IPI > - CPU slab deactivation became more efficient with "mm, slub: splice cpu and > page freelists in deactivate_slab() > > The result is that SLUB's per-cpu and per-node caches are trimmed of free > pages, and partially used pages have higher chance of being either reused of > freed. The trimming effort is controlled by reclaim activity and thus memory > pressure. Before an OOM, a reclaim attempt at highest priority ensures > shrinking all caches. Also being a proper slab shrinker, the shrinking is > now also called as part of the drop_caches sysctl operation. Hi Vlastimil! This makes a lot of sense, however it looks a bit as an overkill to me (on 5.9+). Isn't limiting a number of pages (instead of number of objects) sufficient on 5.9+? If not, maybe we can limit the shrinking to the pre-OOM condition? Do we really need to trip it constantly? Thanks!