On Sat, Aug 24, 2024 at 03:10:09PM -0400, Kent Overstreet wrote: > This adds a new callback method to shrinkers which they can use to > describe anything relevant to memory reclaim about their internal state, > for example object dirtyness. .... > + if (!mutex_trylock(&shrinker_mutex)) { > + seq_buf_puts(out, "(couldn't take shrinker lock)"); > + return; > + } Please don't use the shrinker_mutex like this. There can be tens of thousands of entries in the shrinker list (because memcgs) and holding the shrinker_mutex for long running traversals like this is known to cause latency problems for memcg reaping. If we are at ENOMEM, the last thing we want to be doing is preventing memcgs from being reaped. > + list_for_each_entry(shrinker, &shrinker_list, list) { > + struct shrink_control sc = { .gfp_mask = GFP_KERNEL, }; This iteration and counting setup is neither node or memcg aware. For node aware shrinkers, this will only count the items freeable on node 0, and ignore all the other memory in the system. For memcg systems, it will also only scan the root memcg and so miss counting any memory in memcg owned caches. IOWs, the shrinker iteration mechanism needs to iterate both by NUMA node and by memcg. On large machines with multiple nodes and hosting thousands of memcgs, a total shrinker state iteration is has to walk a -lot- of structures. And example of this is drop_slab() - called from /proc/sys/vm/drop_caches(). It does this to iterate all the shrinkers for all the nodes and memcgs in the system: static unsigned long drop_slab_node(int nid) { unsigned long freed = 0; struct mem_cgroup *memcg = NULL; memcg = mem_cgroup_iter(NULL, NULL, NULL); do { freed += shrink_slab(GFP_KERNEL, nid, memcg, 0); } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL); return freed; } void drop_slab(void) { int nid; int shift = 0; unsigned long freed; do { freed = 0; for_each_online_node(nid) { if (fatal_signal_pending(current)) return; freed += drop_slab_node(nid); } } while ((freed >> shift++) > 1); } Hence any iteration for finding the 10 largest shrinkable caches in the system needs to do something similar. Only, it needs to iterate memcgs first and then aggregate object counts across all nodes for shrinkers that are NUMA aware. Because it needs direct access to the shrinkers, it will need to use the RCU lock + refcount method of traversal because that's the only safe way to go from memcg to shrinker instance. IOWs, it needs to mirror the code in shrink_slab/shrink_slab_memcg to obtain a safe reference to the relevant shrinker so it can call ->count_objects() and store a refcounted pointer to the shrinker(s) that will get printed out after the scan is done.... Once the shrinker iteration is sorted out, I'll look further at the rest of the code in this patch... -Dave. -- Dave Chinner david@xxxxxxxxxxxxx