On 02.11.21 14:25, Michal Hocko wrote: > On Tue 02-11-21 13:39:06, David Hildenbrand wrote: >>>> Yes, but a zonelist cannot be correct for an offline node, where we might >>>> not even have an allocated pgdat yet. No pgdat, no zonelist. So as soon as >>>> we allocate the pgdat and set the node online (->hotadd_new_pgdat()), the zone lists have to be correct. And I can spot an build_all_zonelists() in hotadd_new_pgdat(). >>> >>> Yes, that is what I had in mind. We are talking about two things here. >>> Memoryless nodes and offline nodes. The later sounds like a bug to me. >> >> Agreed. memoryless nodes should just have proper zonelists -- which >> seems to be the case. >> >>>> Maybe __alloc_pages_bulk() and alloc_pages_node() should bail out directly >>>> (VM_BUG()) in case we're providing an offline node with eventually no/stale pgdat as >>>> preferred nid. >>> >>> Historically, those allocation interfaces were not trying to be robust >>> against wrong inputs because that adds cpu cycles for everybody for >>> "what if buggy" code. This has worked (surprisingly) well. Memory less >>> nodes have brought in some confusion but this is still something that we >>> can address on a higher level. Nobody give arbitrary nodes as an input. >>> cpu_to_node might be tricky because it can point to a memory less node >>> which along with __GFP_THISNODE is very likely not something anybody >>> wants. Hence cpu_to_mem should be used for allocations. I hate we have >>> two very similar APIs... >> >> To be precise, I'm wondering if we should do: >> >> diff --git a/include/linux/gfp.h b/include/linux/gfp.h >> index 55b2ec1f965a..8c49b88336ee 100644 >> --- a/include/linux/gfp.h >> +++ b/include/linux/gfp.h >> @@ -565,7 +565,7 @@ static inline struct page * >> __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order) >> { >> VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES); >> - VM_WARN_ON((gfp_mask & __GFP_THISNODE) && !node_online(nid)); >> + VM_WARN_ON(!node_online(nid)); >> >> return __alloc_pages(gfp_mask, order, nid, NULL); >> } >> >> (Or maybe VM_BUG_ON) >> >> Because it cannot possibly work and we'll dereference NULL later. > > VM_BUG_ON would be silent for most configurations and crash would happen > even without it so I am not sure about the additional value. VM_WARN_ON > doesn't really add much on top - except it would crash in some > configurations. If we really care to catch this case then we would have > to do a reasonable fallback with a printk note and a dumps stack. As I learned, VM_BUG_ON and friends are active for e.g., Fedora, which can catch quite some issues early, before they end up in enterprise distro kernels. I think it has value. > Something like > if (unlikely(!node_online(nid))) { > pr_err("%d is an offline numa node and using it is a bug in a caller. Please report...\n"); > dump_stack(); > nid = numa_mem_id(); > } > > But again this is adding quite some cycles to a hotpath of the page > allocator. Is this worth it? Don't think a fallback makes sense. > >>> But something seems wrong in this case. cpu_to_node shouldn't return >>> offline nodes. That is just a land mine. It is not clear to me how the >>> cpu has been brought up so that the numa node allocation was left >>> behind. As pointed in other email add_cpu resp. cpu_up is not it. >>> Is it possible that the cpu bring up was only half way? >> >> I tried to follow the code (what sets a CPU present, what sets a CPU >> online, when do we update cpu_to_node() mapping) and IMHO it's all a big >> mess. Maybe it's clearer to people familiar with that code, but CPU >> hotplug in general seems to be a confusing piece of (arch-specific) code. > > Yes there are different arch specific parts that make this quite hard to > follow. > > I think we want to learn how exactly Alexey brought that cpu up. Because > his initial thought on add_cpu resp cpu_up doesn't seem to be correct. > Or I am just not following the code properly. Once we know all those > details we can get in touch with cpu hotplug maintainers and see what > can we do. Yes. > > Btw. do you plan to send a patch for pcp allocator to use cpu_to_mem? You mean s/cpu_to_node/cpu_to_mem/ or also handling offline nids? cpu_to_mem() corresponds to cpu_to_node() unless on ia64+ppc IIUC, so it won't help for this very report. > One last thing, there were some mentions of __GFP_THISNODE but I fail to > see connection with the pcp allocator... Me to. If pcpu would be using __GFP_THISNODE, we'd be hitting the VM_WARN_ON but still crash. -- Thanks, David / dhildenb