Re: [QUESTION] What memcg lifetime is required by list_lru_add?

Alice Ryhl <aliceryhl@xxxxxxxxxx> · Thu, 28 Nov 2024 13:27:03 +0100

On Wed, Nov 27, 2024 at 11:05 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>
> On Wed, Nov 27, 2024 at 10:04:51PM +0100, Alice Ryhl wrote:
> > Dear SHRINKER and MEMCG experts,
> >
> > When using list_lru_add() and list_lru_del(), it seems to be required
> > that you pass the same value of nid and memcg to both calls, since
> > list_lru_del() might otherwise try to delete it from the wrong list /
> > delete it while holding the wrong spinlock. I'm trying to understand
> > the implications of this requirement on the lifetime of the memcg.
> >
> > Now, looking at list_lru_add_obj() I noticed that it uses rcu locking
> > to keep the memcg object alive for the duration of list_lru_add().
> > That rcu locking is used here seems to imply that without it, the
> > memcg could be deallocated during the list_lru_add() call, which is of
> > course bad. But rcu is not enough on its own to keep the memcg alive
> > all the way until the list_lru_del_obj() call, so how does it ensure
> > that the memcg stays valid for that long?
>
> We don't care if the memcg goes away whilst there are objects on the
> LRU. memcg destruction will reparent the objects to a different
> memcg via memcg_reparent_list_lrus() before the memcg is torn down.
> New objects should not be added to the memcg LRUs once the memcg
> teardown process starts, so there should never be add vs reparent
> races during teardown.
>
> Hence all the list_lru_add_obj() function needs to do is ensure that
> the locking/lifecycle rules for the memcg object that
> mem_cgroup_from_slab_obj() returns are obeyed.
>
> > And if there is a mechanism
> > to keep the memcg alive for the entire duration between add and del,
>
> It's enforced by the -complex- state machine used to tear down
> control groups.
>
> tl;dr: If the memcg gets torn down, it will reparent the objects on
> the LRU to it's parent memcg during the teardown process.
>
> This reparenting happens in the cgroup ->css_offline() method, which
> only happens after the cgroup reference count goes to zero and is
> waited on via:
>
> kill_css
>   percpu_ref_kill_and_confirm(css_killed_ref_fn)
>     <wait>
>     css_killed_ref_fn
>       offline_css
>         mem_cgroup_css_offline
>           memcg_offline_kmem
>             {
>             .....
>             memcg_reparent_objcgs(memcg, parent);
>
>         /*
>          * After we have finished memcg_reparent_objcgs(), all list_lrus
>          * corresponding to this cgroup are guaranteed to remain empty.
>          * The ordering is imposed by list_lru_node->lock taken by
>          * memcg_reparent_list_lrus().
>          */
>             memcg_reparent_list_lrus(memcg, parent)
>             }
>
> Then the cgroup teardown control code then schedules the freeing
> of the memcg container via a RCU work callback when the reference
> count is globally visible as killed and the reference count has gone
> to zero.
>
> Hence the cgroup infrastructure requires RCU protection for the
> duration of unreferenced cgroup object accesses. This allows for
> subsystems to perform operations on the cgroup object without
> needing to holding cgroup references for every access. The complex,
> multi-stage teardown process allows for cgroup objects to release
> objects that it tracks hence avoiding the need for every object the
> cgroup tracks to hold a reference count on the cgroup.
>
> See the comment above css_free_rwork_fn() for more details about the
> teardown process:
>
> /*
>  * css destruction is four-stage process.
>  *
>  * 1. Destruction starts.  Killing of the percpu_ref is initiated.
>  *    Implemented in kill_css().
>  *
>  * 2. When the percpu_ref is confirmed to be visible as killed on all CPUs
>  *    and thus css_tryget_online() is guaranteed to fail, the css can be
>  *    offlined by invoking offline_css().  After offlining, the base ref is
>  *    put.  Implemented in css_killed_work_fn().
>  *
>  * 3. When the percpu_ref reaches zero, the only possible remaining
>  *    accessors are inside RCU read sections.  css_release() schedules the
>  *    RCU callback.
>  *
>  * 4. After the grace period, the css can be freed.  Implemented in
>  *    css_free_rwork_fn().
>  *
>  * It is actually hairier because both step 2 and 4 require process context
>  * and thus involve punting to css->destroy_work adding two additional
>  * steps to the already complex sequence.
>  */

Thanks a lot Dave, this clears it up for me.

I sent a patch containing some additional docs for list_lru:
https://lore.kernel.org/all/20241128-list_lru_memcg_docs-v1-1-7e4568978f4e@xxxxxxxxxx/

Alice