>> Yes, but a zonelist cannot be correct for an offline node, where we might >> not even have an allocated pgdat yet. No pgdat, no zonelist. So as soon as >> we allocate the pgdat and set the node online (->hotadd_new_pgdat()), the zone lists have to be correct. And I can spot an build_all_zonelists() in hotadd_new_pgdat(). > > Yes, that is what I had in mind. We are talking about two things here. > Memoryless nodes and offline nodes. The later sounds like a bug to me. Agreed. memoryless nodes should just have proper zonelists -- which seems to be the case. >> Maybe __alloc_pages_bulk() and alloc_pages_node() should bail out directly >> (VM_BUG()) in case we're providing an offline node with eventually no/stale pgdat as >> preferred nid. > > Historically, those allocation interfaces were not trying to be robust > against wrong inputs because that adds cpu cycles for everybody for > "what if buggy" code. This has worked (surprisingly) well. Memory less > nodes have brought in some confusion but this is still something that we > can address on a higher level. Nobody give arbitrary nodes as an input. > cpu_to_node might be tricky because it can point to a memory less node > which along with __GFP_THISNODE is very likely not something anybody > wants. Hence cpu_to_mem should be used for allocations. I hate we have > two very similar APIs... To be precise, I'm wondering if we should do: diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 55b2ec1f965a..8c49b88336ee 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -565,7 +565,7 @@ static inline struct page * __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order) { VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES); - VM_WARN_ON((gfp_mask & __GFP_THISNODE) && !node_online(nid)); + VM_WARN_ON(!node_online(nid)); return __alloc_pages(gfp_mask, order, nid, NULL); } (Or maybe VM_BUG_ON) Because it cannot possibly work and we'll dereference NULL later. > > But something seems wrong in this case. cpu_to_node shouldn't return > offline nodes. That is just a land mine. It is not clear to me how the > cpu has been brought up so that the numa node allocation was left > behind. As pointed in other email add_cpu resp. cpu_up is not it. > Is it possible that the cpu bring up was only half way? I tried to follow the code (what sets a CPU present, what sets a CPU online, when do we update cpu_to_node() mapping) and IMHO it's all a big mess. Maybe it's clearer to people familiar with that code, but CPU hotplug in general seems to be a confusing piece of (arch-specific) code. Also, I have no clue if cpu_to_node() mapping will get invalidated after unplugging that CPU, or if the mapping will simply stay around for all eternity ... -- Thanks, David / dhildenb