Re: [PATCH 02/10] mm: shrinker: Add a .to_text() method for shrinkers

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 28 Aug 2024 13:32:21 +1000

On Sat, Aug 24, 2024 at 03:10:09PM -0400, Kent Overstreet wrote:
> This adds a new callback method to shrinkers which they can use to
> describe anything relevant to memory reclaim about their internal state,
> for example object dirtyness.
....

> +	if (!mutex_trylock(&shrinker_mutex)) {
> +		seq_buf_puts(out, "(couldn't take shrinker lock)");
> +		return;
> +	}

Please don't use the shrinker_mutex like this. There can be tens of
thousands of entries in the shrinker list (because memcgs) and
holding the shrinker_mutex for long running traversals like this is
known to cause latency problems for memcg reaping. If we are at
ENOMEM, the last thing we want to be doing is preventing memcgs from
being reaped.

> +	list_for_each_entry(shrinker, &shrinker_list, list) {
> +		struct shrink_control sc = { .gfp_mask = GFP_KERNEL, };

This iteration and counting setup is neither node or memcg aware.
For node aware shrinkers, this will only count the items freeable
on node 0, and ignore all the other memory in the system. For memcg
systems, it will also only scan the root memcg and so miss counting
any memory in memcg owned caches.

IOWs, the shrinker iteration mechanism needs to iterate both by NUMA
node and by memcg. On large machines with multiple nodes and hosting
thousands of memcgs, a total shrinker state iteration is has to walk
a -lot- of structures.

And example of this is drop_slab() - called from
/proc/sys/vm/drop_caches(). It does this to iterate all the
shrinkers for all the nodes and memcgs in the system:

static unsigned long drop_slab_node(int nid)
{
        unsigned long freed = 0;
        struct mem_cgroup *memcg = NULL;

        memcg = mem_cgroup_iter(NULL, NULL, NULL);
        do {
                freed += shrink_slab(GFP_KERNEL, nid, memcg, 0);
        } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);

        return freed;
}

void drop_slab(void)
{
        int nid;
        int shift = 0;
        unsigned long freed;

        do {
                freed = 0;
                for_each_online_node(nid) {
                        if (fatal_signal_pending(current))
                                return;

                        freed += drop_slab_node(nid);
                }
        } while ((freed >> shift++) > 1);
}

Hence any iteration for finding the 10 largest shrinkable caches in
the system needs to do something similar. Only, it needs to iterate
memcgs first and then aggregate object counts across all nodes for
shrinkers that are NUMA aware.

Because it needs direct access to the shrinkers, it will need to use
the RCU lock + refcount method of traversal because that's the only
safe way to go from memcg to shrinker instance. IOWs, it
needs to mirror the code in shrink_slab/shrink_slab_memcg to obtain
a safe reference to the relevant shrinker so it can call
->count_objects() and store a refcounted pointer to the shrinker(s)
that will get printed out after the scan is done....

Once the shrinker iteration is sorted out, I'll look further at the
rest of the code in this patch...

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx