On Tue, 7 Dec 2021 17:40:13 -0500 Nico Pache <npache@xxxxxxxxxx> wrote: > We have run into a panic caused by a shrinker allocation being attempted > on an offlined node. > > Our crash analysis has determined that the issue originates from trying > to allocate pages on an offlined node in expand_one_shrinker_info. This > function makes the incorrect assumption that we can allocate on any node. > To correct this we make sure the node is online before tempting an > allocation. If it is not online choose the closest node. This isn't fully accurate, is it? We could allocate on a node which is presently offline but which was previously onlined, by testing NODE_DATA(nid). It isn't entirely clear to me from the v1 discussion why this approach isn't being taken? AFAICT the proposed patch is *already* taking this approach, by having no protection against a concurrent or subsequent node offlining? > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -222,13 +222,16 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg, > int size = map_size + defer_size; > > for_each_node(nid) { > + int tmp = nid; Not `tmp', please. Better to use an identifier which explains the variable's use. target_nid? And a newline after defining locals, please. > pn = memcg->nodeinfo[nid]; > old = shrinker_info_protected(memcg, nid); > /* Not yet online memcg */ > if (!old) > return 0; > > - new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid); > + if(!node_online(nid)) s/if(/if (/ > + tmp = numa_mem_id(); > + new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, tmp); > if (!new) > return -ENOMEM; > And a code comment fully explaining what's going on here?