On Wed, Nov 27, 2024 at 10:04:51PM +0100, Alice Ryhl wrote: > Dear SHRINKER and MEMCG experts, > > When using list_lru_add() and list_lru_del(), it seems to be required > that you pass the same value of nid and memcg to both calls, since > list_lru_del() might otherwise try to delete it from the wrong list / > delete it while holding the wrong spinlock. I'm trying to understand > the implications of this requirement on the lifetime of the memcg. > > Now, looking at list_lru_add_obj() I noticed that it uses rcu locking > to keep the memcg object alive for the duration of list_lru_add(). > That rcu locking is used here seems to imply that without it, the > memcg could be deallocated during the list_lru_add() call, which is of > course bad. But rcu is not enough on its own to keep the memcg alive > all the way until the list_lru_del_obj() call, so how does it ensure > that the memcg stays valid for that long? We don't care if the memcg goes away whilst there are objects on the LRU. memcg destruction will reparent the objects to a different memcg via memcg_reparent_list_lrus() before the memcg is torn down. New objects should not be added to the memcg LRUs once the memcg teardown process starts, so there should never be add vs reparent races during teardown. Hence all the list_lru_add_obj() function needs to do is ensure that the locking/lifecycle rules for the memcg object that mem_cgroup_from_slab_obj() returns are obeyed. > And if there is a mechanism > to keep the memcg alive for the entire duration between add and del, It's enforced by the -complex- state machine used to tear down control groups. tl;dr: If the memcg gets torn down, it will reparent the objects on the LRU to it's parent memcg during the teardown process. This reparenting happens in the cgroup ->css_offline() method, which only happens after the cgroup reference count goes to zero and is waited on via: kill_css percpu_ref_kill_and_confirm(css_killed_ref_fn) <wait> css_killed_ref_fn offline_css mem_cgroup_css_offline memcg_offline_kmem { ..... memcg_reparent_objcgs(memcg, parent); /* * After we have finished memcg_reparent_objcgs(), all list_lrus * corresponding to this cgroup are guaranteed to remain empty. * The ordering is imposed by list_lru_node->lock taken by * memcg_reparent_list_lrus(). */ memcg_reparent_list_lrus(memcg, parent) } Then the cgroup teardown control code then schedules the freeing of the memcg container via a RCU work callback when the reference count is globally visible as killed and the reference count has gone to zero. Hence the cgroup infrastructure requires RCU protection for the duration of unreferenced cgroup object accesses. This allows for subsystems to perform operations on the cgroup object without needing to holding cgroup references for every access. The complex, multi-stage teardown process allows for cgroup objects to release objects that it tracks hence avoiding the need for every object the cgroup tracks to hold a reference count on the cgroup. See the comment above css_free_rwork_fn() for more details about the teardown process: /* * css destruction is four-stage process. * * 1. Destruction starts. Killing of the percpu_ref is initiated. * Implemented in kill_css(). * * 2. When the percpu_ref is confirmed to be visible as killed on all CPUs * and thus css_tryget_online() is guaranteed to fail, the css can be * offlined by invoking offline_css(). After offlining, the base ref is * put. Implemented in css_killed_work_fn(). * * 3. When the percpu_ref reaches zero, the only possible remaining * accessors are inside RCU read sections. css_release() schedules the * RCU callback. * * 4. After the grace period, the css can be freed. Implemented in * css_free_rwork_fn(). * * It is actually hairier because both step 2 and 4 require process context * and thus involve punting to css->destroy_work adding two additional * steps to the already complex sequence. */ -Dave. -- Dave Chinner david@xxxxxxxxxxxxx