On 12/6/21 04:22, Michal Hocko wrote: > On Sun 05-12-21 22:33:38, Nico Pache wrote: >> We have run into a panic caused by a shrinker allocation being attempted >> on an offlined node. >> >> Our crash analysis has determined that the issue originates from trying >> to allocate pages on an offlined node in expand_one_shrinker_info. This >> function makes the incorrect assumption that we can allocate on any node. >> To correct this we make sure we only itterate over online nodes. >> >> This assumption can lead to an incorrect address being assigned to ac->zonelist >> in the following callchain: >> __alloc_pages >> -> prepare_alloc_pages >> -> node_zonelist >> >> static inline struct zonelist *node_zonelist(int nid, gfp_t flags) >> { >> return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags); >> } >> if the node is not online the return of node_zonelist will evaluate to a >> invalid pointer of 0x00000 + offset_of(node_zonelists) + (1|0) >> >> This address is then dereferenced further down the callchain in: >> prepare_alloc_pages >> -> first_zones_zonelist >> -> next_zones_zonelist >> -> zonelist_zone_idx >> >> static inline int zonelist_zone_idx(struct zoneref *zoneref) >> { >> return zoneref->zone_idx; >> } >> >> Leading the system to panic. > > Thanks for the analysis! Please also add an oops report so that this is > easier to search for. It would be also interesting to see specifics > about the issue. Why was the specific node !online in the first place? > What architecture was this on? Here is the Oops report. I will also add it to my commit message on the second posting! This was x86 btw, but it has also been hit on PPC64. [ 362.179917] RIP: 0010:prepare_alloc_pages.constprop.0+0xc6/0x150 [ 362.186639] Code: 03 80 c9 80 83 e2 03 83 fa 01 0f 44 c1 41 89 04 24 c1 eb 0c 48 8b 55 08 83 e3 01 8b 75 1c 48 8b 7d 00 88 5d 20 48 85 d2 75 6b <3b> 77 08 72 66 48 89 7d 10 b8 01 00 00 00 48 83 c4 08 5b 5d 41 5c [ 362.207604] RSP: 0018:ffffb4ba31427bc0 EFLAGS: 00010246 [ 362.213443] RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000081 [ 362.221412] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000002080 [ 362.229380] RBP: ffffb4ba31427bf8 R08: 0000000000000000 R09: ffffb4ba31427bf4 [ 362.237347] R10: 0000000000000000 R11: 0000000000000000 R12: ffffb4ba31427bf0 [ 362.245316] R13: 0000000000000002 R14: ffff9f2fb3788000 R15: 0000000000000078 [ 362.253285] FS: 00007fbc57bfd740(0000) GS:ffff9f4c7d780000(0000) knlGS:0000000000000000 [ 362.262322] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 362.268739] CR2: 0000000000002088 CR3: 000002004cb58002 CR4: 00000000007706e0 [ 362.276707] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 362.284675] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 362.292644] PKRU: 55555554 [ 362.295669] Call Trace: [ 362.298402] __alloc_pages+0x9d/0x210 [ 362.302501] kmalloc_large_node+0x40/0x90 [ 362.306988] __kmalloc_node+0x3ac/0x480 [ 362.311279] kvmalloc_node+0x46/0x80 [ 362.315276] expand_one_shrinker_info+0x84/0x190 [ 362.320437] prealloc_shrinker+0x166/0x1c0 [ 362.325015] alloc_super+0x2ba/0x330 [ 362.329011] ? __fput_sync+0x30/0x30 [ 362.333003] ? set_anon_super+0x40/0x40 [ 362.337288] sget_fc+0x6c/0x2f0 [ 362.340798] ? mqueue_create+0x20/0x20 [ 362.344992] get_tree_keyed+0x2f/0xc0 [ 362.349086] vfs_get_tree+0x25/0xb0 [ 362.352982] fc_mount+0xe/0x30 [ 362.356397] mq_init_ns+0x105/0x1a0 [ 362.360291] copy_ipcs+0x129/0x220 [ 362.364093] create_new_namespaces+0xa1/0x2e0 [ 362.368960] unshare_nsproxy_namespaces+0x55/0xa0 [ 362.374216] ksys_unshare+0x198/0x380 [ 362.378310] __x64_sys_unshare+0xe/0x20 [ 362.382595] do_syscall_64+0x3b/0x90 [ 362.386597] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 362.392247] RIP: 0033:0x7fbc57d14d7b [ 362.396242] Code: 73 01 c3 48 8b 0d ad 70 0e 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 10 01 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 7d 70 0e 00 f7 d8 64 89 01 48 [ 362.417204] RSP: 002b:00007fbc4dc73f88 EFLAGS: 00000206 ORIG_RAX: 0000000000000110 [ 362.425664] RAX: ffffffffffffffda RBX: 0000560144b09578 RCX: 00007fbc57d14d7b [ 362.433634] RDX: 0000000000000010 RSI: 00007fbc4dc73f90 RDI: 0000000008000000 [ 362.441602] RBP: 0000560144b095a0 R08: 0000000000000000 R09: 0000000000000000 [ 362.449573] R10: 0000000000000000 R11: 0000000000000206 R12: 00007fbc4dc73f90 [ 362.457542] R13: 000000000000006a R14: 0000560144e7e5a0 R15: 00007fff5dec8e10 > >> We also correct this behavior in alloc_shrinker_info, free_shrinker_info, >> and reparent_shrinker_deferred. >> >> Fixes: 2bfd36374edd ("mm: vmscan: consolidate shrinker_maps handling code") >> Fixes: 0a4465d34028 ("mm, memcg: assign memcg-aware shrinkers bitmap to memcg") > > Normally I would split the fix as it is fixing two issues one introduced > in 4.19 the other in 5.13. These are both commits that introduced the function, one introduces it while the other moves it to a separate file. But as Yang Shi pointed out the better commit to blame is 86daf94efb11 ("mm/memcontrol.c: allocate shrinker_map on appropriate NUMA node") which actually made the change that would have caused the allocator to go for an !online node. > >> Signed-off-by: Nico Pache <npache@xxxxxxxxxx> >> --- >> mm/vmscan.c | 8 ++++---- >> 1 file changed, 4 insertions(+), 4 deletions(-) >> >> diff --git a/mm/vmscan.c b/mm/vmscan.c >> index fb9584641ac7..731564b61e3f 100644 >> --- a/mm/vmscan.c >> +++ b/mm/vmscan.c >> @@ -221,7 +221,7 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg, >> int nid; >> int size = map_size + defer_size; >> >> - for_each_node(nid) { >> + for_each_online_node(nid) { >> pn = memcg->nodeinfo[nid]; >> old = shrinker_info_protected(memcg, nid); >> /* Not yet online memcg */ >> @@ -256,7 +256,7 @@ void free_shrinker_info(struct mem_cgroup *memcg) >> struct shrinker_info *info; >> int nid; >> >> - for_each_node(nid) { >> + for_each_online_node(nid) { >> pn = memcg->nodeinfo[nid]; >> info = rcu_dereference_protected(pn->shrinker_info, true); >> kvfree(info); >> @@ -274,7 +274,7 @@ int alloc_shrinker_info(struct mem_cgroup *memcg) >> map_size = shrinker_map_size(shrinker_nr_max); >> defer_size = shrinker_defer_size(shrinker_nr_max); >> size = map_size + defer_size; >> - for_each_node(nid) { >> + for_each_online_node(nid) { >> info = kvzalloc_node(sizeof(*info) + size, GFP_KERNEL, nid); >> if (!info) { >> free_shrinker_info(memcg); >> @@ -417,7 +417,7 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg) >> >> /* Prevent from concurrent shrinker_info expand */ >> down_read(&shrinker_rwsem); >> - for_each_node(nid) { >> + for_each_online_node(nid) { >> child_info = shrinker_info_protected(memcg, nid); >> parent_info = shrinker_info_protected(parent, nid); >> for (i = 0; i < shrinker_nr_max; i++) { >> -- >> 2.33.1 > > This doesn't seen complete. Slab shrinkers are used in the reclaim > context. Previously offline nodes could be onlined later and this would > lead to NULL ptr because there is no hook to allocate new shrinker > infos. This would be also really impractical because this would have to > update all existing memcgs... > > To be completely honest I am not really sure this is a practical problem > because some architectures allocate (aka make online) all possible nodes > reported by the platform. There are major inconsistencies there. Maybe > that should be unified, so that problems like this one do not really > have to add a complexity to the code. Im currently working a solution that will use register_one_node to perform the memcg node allocation for all memcgs. I will post that once I've verified all potential corner cases. Cheers, -- Nico >