On Mon 06-12-21 14:47:13, David Hildenbrand wrote: > On 06.12.21 14:06, Michal Hocko wrote: > > On Mon 06-12-21 13:43:27, David Hildenbrand wrote: > > [...] > >>> Now practically speaking !node_online should not apear node_online (note > >>> I am attentionally avoiding to say offline and online as that has a > >>> completely different semantic) shouldn't really happen for some > >>> architectures. x86 should allocate pgdat for each possible node. I do > >>> not know what was the architecture in this case but we already have > >>> another report for x86 that remains unexplained. > >> > >> So we'd allocate the pgdat although all we want is just a zonelist. The > >> obvious alternative is to implement the fallback where reasonable -- for > >> example, in the page allocator. It knows the fallback order: > >> build_zonelists(). That's pretty much all we need the preferred_nid for. > >> > >> So just making prepare_alloc_pages()/node_zonelist() deal with a missing > >> pgdat could make sense as well. Something like: > >> > >> > >> diff --git a/include/linux/gfp.h b/include/linux/gfp.h > >> index b976c4177299..2d2649e78766 100644 > >> --- a/include/linux/gfp.h > >> +++ b/include/linux/gfp.h > >> @@ -508,9 +508,14 @@ static inline int gfp_zonelist(gfp_t flags) > >> * > >> * For the case of non-NUMA systems the NODE_DATA() gets optimized to > >> * &contig_page_data at compile-time. > >> + * > >> + * If the node does not have a pgdat yet, returns the zonelist of the > >> + * first online node. > >> */ > >> static inline struct zonelist *node_zonelist(int nid, gfp_t flags) > >> { > >> + if (unlikely(!NODE_DATA(nid))) > >> + nid = first_online_node; > >> return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags); > >> } > > > > This is certainly possible. But it a) adds a branch to the hotpath and > > b) it doesn't solve any other potential dereference of garbage node. > > I don't think a) is a problem but it's easy to measure. Agreed to b), > however, the page allocator has been the most prominent source of error > reports for this. > > > > >> But of course, there might be value in a proper node-aware fallback list > >> as we have in build_zonelists() -- but it remains questionable if the > >> difference for these corner cases would be relevant in practice. > > > > Only the platform knows the proper node topology and that includes > > memory less nodes. So they should be setting up a node properly and we > > shouldn't be dealing with this at the allocator nor any other code. > > I *think* there are cases where the topology of a new node is only know > once it actually gets used. For example, I remember talking to CXL and > there are ideas to have a pool of possible nodes, which can get used > dynamically for CXL memory. Of course, some kind of reconfiguration > could be imaginable. I am not really familiar with CXL but if a dynamic reconfiguration is needed then this would only require to update the zonelist which should be a trivial thing to do. The code using the node doesn't really have to be aware of that as the pgdat will stay unchanged. > >> Further, if we could have thousands of nodes, we'd have to update each > >> and every one when building zone lists ... > > > > Why would that be a practical problem? > > We'll need at least > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index c5952749ad40..e5d958abc7cc 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -6382,7 +6382,7 @@ static void __build_all_zonelists(void *data) > if (self && !node_online(self->node_id)) { > build_zonelists(self); > } else { > - for_each_online_node(nid) { > + for_each_node(nid) { > pg_data_t *pgdat = NODE_DATA(nid); > > build_zonelists(pgdat); If we do have preallocated pgdat for all possible nodes that for_each_online_node doesn't really need to exist. > But there might be more missing. Onlining a new zone will get more > expensive in setups with a lot of possible nodes (x86-64 shouldn't > really be an issue in that regard). Honestly, I am not really concerned by platforms with too many nodes without any memory. If they want to shoot their feet then that's their choice. We can optimize for those if they ever prove to be standar. > If we want stable backports, we'll want something simple upfront. For stable backports I would be fine by doing your NODE_DATA check in the allocator. In upstream I think we should be aiming for a more robust solution that is also easier to maintain further down the line. Even if that is an investment at this momemnt because the initialization code is a mess. -- Michal Hocko SUSE Labs