On Wed, Nov 27, 2024 at 11:05 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > On Wed, Nov 27, 2024 at 10:04:51PM +0100, Alice Ryhl wrote: > > Dear SHRINKER and MEMCG experts, > > > > When using list_lru_add() and list_lru_del(), it seems to be required > > that you pass the same value of nid and memcg to both calls, since > > list_lru_del() might otherwise try to delete it from the wrong list / > > delete it while holding the wrong spinlock. I'm trying to understand > > the implications of this requirement on the lifetime of the memcg. > > > > Now, looking at list_lru_add_obj() I noticed that it uses rcu locking > > to keep the memcg object alive for the duration of list_lru_add(). > > That rcu locking is used here seems to imply that without it, the > > memcg could be deallocated during the list_lru_add() call, which is of > > course bad. But rcu is not enough on its own to keep the memcg alive > > all the way until the list_lru_del_obj() call, so how does it ensure > > that the memcg stays valid for that long? > > We don't care if the memcg goes away whilst there are objects on the > LRU. memcg destruction will reparent the objects to a different > memcg via memcg_reparent_list_lrus() before the memcg is torn down. > New objects should not be added to the memcg LRUs once the memcg > teardown process starts, so there should never be add vs reparent > races during teardown. > > Hence all the list_lru_add_obj() function needs to do is ensure that > the locking/lifecycle rules for the memcg object that > mem_cgroup_from_slab_obj() returns are obeyed. > > > And if there is a mechanism > > to keep the memcg alive for the entire duration between add and del, > > It's enforced by the -complex- state machine used to tear down > control groups. > > tl;dr: If the memcg gets torn down, it will reparent the objects on > the LRU to it's parent memcg during the teardown process. > > This reparenting happens in the cgroup ->css_offline() method, which > only happens after the cgroup reference count goes to zero and is > waited on via: > > kill_css > percpu_ref_kill_and_confirm(css_killed_ref_fn) > <wait> > css_killed_ref_fn > offline_css > mem_cgroup_css_offline > memcg_offline_kmem > { > ..... > memcg_reparent_objcgs(memcg, parent); > > /* > * After we have finished memcg_reparent_objcgs(), all list_lrus > * corresponding to this cgroup are guaranteed to remain empty. > * The ordering is imposed by list_lru_node->lock taken by > * memcg_reparent_list_lrus(). > */ > memcg_reparent_list_lrus(memcg, parent) > } > > Then the cgroup teardown control code then schedules the freeing > of the memcg container via a RCU work callback when the reference > count is globally visible as killed and the reference count has gone > to zero. > > Hence the cgroup infrastructure requires RCU protection for the > duration of unreferenced cgroup object accesses. This allows for > subsystems to perform operations on the cgroup object without > needing to holding cgroup references for every access. The complex, > multi-stage teardown process allows for cgroup objects to release > objects that it tracks hence avoiding the need for every object the > cgroup tracks to hold a reference count on the cgroup. > > See the comment above css_free_rwork_fn() for more details about the > teardown process: > > /* > * css destruction is four-stage process. > * > * 1. Destruction starts. Killing of the percpu_ref is initiated. > * Implemented in kill_css(). > * > * 2. When the percpu_ref is confirmed to be visible as killed on all CPUs > * and thus css_tryget_online() is guaranteed to fail, the css can be > * offlined by invoking offline_css(). After offlining, the base ref is > * put. Implemented in css_killed_work_fn(). > * > * 3. When the percpu_ref reaches zero, the only possible remaining > * accessors are inside RCU read sections. css_release() schedules the > * RCU callback. > * > * 4. After the grace period, the css can be freed. Implemented in > * css_free_rwork_fn(). > * > * It is actually hairier because both step 2 and 4 require process context > * and thus involve punting to css->destroy_work adding two additional > * steps to the already complex sequence. > */ Thanks a lot Dave, this clears it up for me. I sent a patch containing some additional docs for list_lru: https://lore.kernel.org/all/20241128-list_lru_memcg_docs-v1-1-7e4568978f4e@xxxxxxxxxx/ Alice