Re: [QUESTION] What memcg lifetime is required by list_lru_add?

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 28 Nov 2024 09:05:34 +1100

On Wed, Nov 27, 2024 at 10:04:51PM +0100, Alice Ryhl wrote:
> Dear SHRINKER and MEMCG experts,
> 
> When using list_lru_add() and list_lru_del(), it seems to be required
> that you pass the same value of nid and memcg to both calls, since
> list_lru_del() might otherwise try to delete it from the wrong list /
> delete it while holding the wrong spinlock. I'm trying to understand
> the implications of this requirement on the lifetime of the memcg.
> 
> Now, looking at list_lru_add_obj() I noticed that it uses rcu locking
> to keep the memcg object alive for the duration of list_lru_add().
> That rcu locking is used here seems to imply that without it, the
> memcg could be deallocated during the list_lru_add() call, which is of
> course bad. But rcu is not enough on its own to keep the memcg alive
> all the way until the list_lru_del_obj() call, so how does it ensure
> that the memcg stays valid for that long?

We don't care if the memcg goes away whilst there are objects on the
LRU. memcg destruction will reparent the objects to a different
memcg via memcg_reparent_list_lrus() before the memcg is torn down.
New objects should not be added to the memcg LRUs once the memcg
teardown process starts, so there should never be add vs reparent
races during teardown.

Hence all the list_lru_add_obj() function needs to do is ensure that
the locking/lifecycle rules for the memcg object that
mem_cgroup_from_slab_obj() returns are obeyed.

> And if there is a mechanism
> to keep the memcg alive for the entire duration between add and del,

It's enforced by the -complex- state machine used to tear down
control groups.

tl;dr: If the memcg gets torn down, it will reparent the objects on
the LRU to it's parent memcg during the teardown process.

This reparenting happens in the cgroup ->css_offline() method, which
only happens after the cgroup reference count goes to zero and is
waited on via:

kill_css
  percpu_ref_kill_and_confirm(css_killed_ref_fn)
    <wait>
    css_killed_ref_fn
      offline_css
        mem_cgroup_css_offline
	  memcg_offline_kmem
	    {
	    .....
	    memcg_reparent_objcgs(memcg, parent);

        /*
         * After we have finished memcg_reparent_objcgs(), all list_lrus
         * corresponding to this cgroup are guaranteed to remain empty.
         * The ordering is imposed by list_lru_node->lock taken by
         * memcg_reparent_list_lrus().
         */
	    memcg_reparent_list_lrus(memcg, parent)
	    }

Then the cgroup teardown control code then schedules the freeing
of the memcg container via a RCU work callback when the reference
count is globally visible as killed and the reference count has gone
to zero.

Hence the cgroup infrastructure requires RCU protection for the
duration of unreferenced cgroup object accesses. This allows for
subsystems to perform operations on the cgroup object without
needing to holding cgroup references for every access. The complex,
multi-stage teardown process allows for cgroup objects to release
objects that it tracks hence avoiding the need for every object the
cgroup tracks to hold a reference count on the cgroup.

See the comment above css_free_rwork_fn() for more details about the
teardown process:

/*
 * css destruction is four-stage process.
 *
 * 1. Destruction starts.  Killing of the percpu_ref is initiated.
 *    Implemented in kill_css().
 *
 * 2. When the percpu_ref is confirmed to be visible as killed on all CPUs
 *    and thus css_tryget_online() is guaranteed to fail, the css can be
 *    offlined by invoking offline_css().  After offlining, the base ref is
 *    put.  Implemented in css_killed_work_fn().
 *
 * 3. When the percpu_ref reaches zero, the only possible remaining
 *    accessors are inside RCU read sections.  css_release() schedules the
 *    RCU callback.
 *
 * 4. After the grace period, the css can be freed.  Implemented in
 *    css_free_rwork_fn().
 *
 * It is actually hairier because both step 2 and 4 require process context
 * and thus involve punting to css->destroy_work adding two additional
 * steps to the already complex sequence.
 */

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx