On Fri 02-08-24 16:58:22, Shakeel Butt wrote: > The commit 73f576c04b94 ("mm: memcontrol: fix cgroup creation failure > after many small jobs") decoupled the memcg IDs from the CSS ID space to > fix the cgroup creation failures. It introduced IDR to maintain the > memcg ID space. The IDR depends on external synchronization mechanisms > for modifications. For the mem_cgroup_idr, the idr_alloc() and > idr_replace() happen within css callback and thus are protected through > cgroup_mutex from concurrent modifications. However idr_remove() for > mem_cgroup_idr was not protected against concurrency and can be run > concurrently for different memcgs when they hit their refcnt to zero. > Fix that. > > We have been seeing list_lru based kernel crashes at a low frequency in > our fleet for a long time. These crashes were in different part of > list_lru code including list_lru_add(), list_lru_del() and reparenting > code. Upon further inspection, it looked like for a given object (dentry > and inode), the super_block's list_lru didn't have list_lru_one for the > memcg of that object. The initial suspicions were either the object is > not allocated through kmem_cache_alloc_lru() or somehow > memcg_list_lru_alloc() failed to allocate list_lru_one() for a memcg but > returned success. No evidence were found for these cases. > > Looking more deeper, we started seeing situations where valid memcg's id > is not present in mem_cgroup_idr and in some cases multiple valid memcgs > have same id and mem_cgroup_idr is pointing to one of them. So, the most > reasonable explanation is that these situations can happen due to race > between multiple idr_remove() calls or race between > idr_alloc()/idr_replace() and idr_remove(). These races are causing > multiple memcgs to acquire the same ID and then offlining of one of them > would cleanup list_lrus on the system for all of them. Later access from > other memcgs to the list_lru cause crashes due to missing list_lru_one. > > Fixes: 73f576c04b94 ("mm: memcontrol: fix cgroup creation failure after many small jobs") > Signed-off-by: Shakeel Butt <shakeel.butt@xxxxxxxxx> Sorry for being late here. Thanks for catching this up. Acked-by: Michal Hocko <mhocko@xxxxxxxx> Cc: stable is definitely due here as those races are really nasty to debug. Thanks! > --- > mm/memcontrol.c | 22 ++++++++++++++++++++-- > 1 file changed, 20 insertions(+), 2 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index b889a7fbf382..8971d3473a7b 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -3364,11 +3364,28 @@ static void memcg_wb_domain_size_changed(struct mem_cgroup *memcg) > > #define MEM_CGROUP_ID_MAX ((1UL << MEM_CGROUP_ID_SHIFT) - 1) > static DEFINE_IDR(mem_cgroup_idr); > +static DEFINE_SPINLOCK(memcg_idr_lock); > + > +static int mem_cgroup_alloc_id(void) > +{ > + int ret; > + > + idr_preload(GFP_KERNEL); > + spin_lock(&memcg_idr_lock); > + ret = idr_alloc(&mem_cgroup_idr, NULL, 1, MEM_CGROUP_ID_MAX + 1, > + GFP_NOWAIT); > + spin_unlock(&memcg_idr_lock); > + idr_preload_end(); > + return ret; > +} > > static void mem_cgroup_id_remove(struct mem_cgroup *memcg) > { > if (memcg->id.id > 0) { > + spin_lock(&memcg_idr_lock); > idr_remove(&mem_cgroup_idr, memcg->id.id); > + spin_unlock(&memcg_idr_lock); > + > memcg->id.id = 0; > } > } > @@ -3502,8 +3519,7 @@ static struct mem_cgroup *mem_cgroup_alloc(struct mem_cgroup *parent) > if (!memcg) > return ERR_PTR(error); > > - memcg->id.id = idr_alloc(&mem_cgroup_idr, NULL, > - 1, MEM_CGROUP_ID_MAX + 1, GFP_KERNEL); > + memcg->id.id = mem_cgroup_alloc_id(); > if (memcg->id.id < 0) { > error = memcg->id.id; > goto fail; > @@ -3648,7 +3664,9 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css) > * publish it here at the end of onlining. This matches the > * regular ID destruction during offlining. > */ > + spin_lock(&memcg_idr_lock); > idr_replace(&mem_cgroup_idr, memcg, memcg->id.id); > + spin_unlock(&memcg_idr_lock); > > return 0; > offline_kmem: > -- > 2.43.5 -- Michal Hocko SUSE Labs