On Wed, Sep 09, 2020 at 09:55:20AM -0700, Roman Gushchin wrote: > On Wed, Sep 09, 2020 at 10:55:34AM -0400, Johannes Weiner wrote: > > On Thu, Sep 03, 2020 at 04:00:55PM -0700, Roman Gushchin wrote: > > > In the memcg case count_shadow_nodes() sums the number of pages in lru > > > lists and the amount of slab memory (reclaimable and non-reclaimable) > > > as a baseline for the allowed number of shadow entries. > > > > > > It seems to be a good analogy for the !memcg case, where > > > node_present_pages() is used. However, it's not quite true, as there > > > two problems: > > > > > > 1) Due to slab reparenting introduced by commit fb2f2b0adb98 ("mm: > > > memcg/slab: reparent memcg kmem_caches on cgroup removal") local > > > per-lruvec slab counters might be inaccurate on non-leaf levels. > > > It's the only place where local slab counters are used. > > > > Hm, that sounds like a bug tbh. We're reparenting the kmem caches and > > the individual objects on the list_lru when a cgroup is removed - > > shouldn't we also reparent the corresponding memory counters? > > It's definitely an option too, the question is if the added code complexity > is really worth it. I'd say no had we talk only about slab counters, > but when we'll eventually start reparenting pagecache, we'll need to reparent > other counters as well, so we'll need it anyway. So, ok, let's drop > this patch for now. > > > > > > 2) Shadow nodes by themselves are backed by slabs. So there is a loop > > > dependency: the more shadow entries are there, the less pressure the > > > kernel applies to reclaim them. > > > > This effect is negligible in practice. > > > > The permitted shadow nodes are a tiny percentage of memory consumed by > > the cgroup. If shadow nodes make up a significant part of the cgroup's > > footprint, or are the only thing left, they will be pushed out fast. > > > > The formula is max_nodes = total_pages >> 3, and one page can hold 28 > > nodes. So if the cgroup holds nothing but 262,144 pages (1G) of shadow > > nodes, the shrinker target is 32,768 nodes, which is 32,768 pages > > (128M) in the worst packing case and 1,170 pages (4M) at best. > > > > However, if you don't take slab into account here, it can evict shadow > > entries with undue aggression when they are needed the most. If, say, > > the inode or dentry cache explode temporarily and displace the page > > cache, it would be a big problem to drop the cache's non-resident info > > at the same time! This is when it's at its most important. > > Just curious, have you seen this in the real life? I have seen it with anon pages back when we targeted the page cache instead of the total memory footprint. Especially in the context of psi it missed thrashing situations, see: commit 95f9ab2d596e8cbb388315e78c82b9a131bf2928 Author: Johannes Weiner <jweiner@xxxxxx> Date: Fri Oct 26 15:05:59 2018 -0700 mm: workingset: don't drop refault information prematurely I can't remember if I saw slabs doing the same, but it's equally plausible.