On 07.03.22 16:07, Oscar Salvador wrote: > All possible nodes are now pre-allocated at boot time by free_area_init()-> > free_area_init_node(), and those which are to be hot-plugged are initialized > later on by hotadd_init_pgdat()->free_area_init_core_hotplug() when they > become online. > > free_area_init_core_hotplug() calls pgdat_init_internals() and > zone_init_internals() to initialize some internal data structures > and zeroes a few pgdat fields. > > But we do already call pgdat_init_internals() and zone_init_internals() > for all possible nodes back in free_area_init_core(), and pgdat fields > are already zeroed because the pre-allocation memsets with 0s the > structure, meaning we do not need to repeat the process when > the node becomes online. > > So initialize it only once when booting, and make sure to reset > the fields we care about to 0 when the node goes empty. > The only thing we need to check for is to allocate per_cpu_nodestats > struct the very first time this node goes online. > > node_reset_state() is the function in charge of resetting pgdat's fields, > and it is called when offline_pages() detects that the node becomes empty > worth of memory. > > Signed-off-by: Oscar Salvador <osalvador@xxxxxxx> > --- > include/linux/memory_hotplug.h | 2 +- > mm/memory_hotplug.c | 58 +++++++++++++++++++++------------- > mm/page_alloc.c | 49 +++++----------------------- > 3 files changed, 45 insertions(+), 64 deletions(-) > > diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h > index 76bf2de86def..fcf4c9a023cc 100644 > --- a/include/linux/memory_hotplug.h > +++ b/include/linux/memory_hotplug.h > @@ -319,7 +319,7 @@ extern void set_zone_contiguous(struct zone *zone); > extern void clear_zone_contiguous(struct zone *zone); > > #ifdef CONFIG_MEMORY_HOTPLUG > -extern void __ref free_area_init_core_hotplug(struct pglist_data *pgdat); > +extern bool pgdat_has_boot_nodestats(pg_data_t *pgdat); > extern int __add_memory(int nid, u64 start, u64 size, mhp_t mhp_flags); > extern int add_memory(int nid, u64 start, u64 size, mhp_t mhp_flags); > extern int add_memory_resource(int nid, struct resource *resource, > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > index ddc62f8b591f..07cece9e22e4 100644 > --- a/mm/memory_hotplug.c > +++ b/mm/memory_hotplug.c > @@ -1164,18 +1164,18 @@ static void reset_node_present_pages(pg_data_t *pgdat) > /* we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG */ > static pg_data_t __ref *hotadd_init_pgdat(int nid) > { > - struct pglist_data *pgdat; > + struct pglist_data *pgdat = NODE_DATA(nid); > > /* > - * NODE_DATA is preallocated (free_area_init) but its internal > - * state is not allocated completely. Add missing pieces. > - * Completely offline nodes stay around and they just need > - * reintialization. > + * NODE_DATA is preallocated (free_area_init), the only thing missing > + * is to allocate its per_cpu_nodestats struct and to build node's > + * zonelists. The allocation of per_cpu_nodestats only needs to be done > + * the very first time this node is brought up, as we reset its state > + * when all node's memory goes offline. > */ > - pgdat = NODE_DATA(nid); > - > - /* init node's zones as empty zones, we don't have any present pages.*/ > - free_area_init_core_hotplug(pgdat); > + if (pgdat_has_boot_nodestats(pgdat)) > + pgdat->per_cpu_nodestats = alloc_percpu_gfp(struct per_cpu_nodestat, > + __GFP_ZERO); > > /* > * The node we allocated has no zone fallback lists. For avoiding > @@ -1183,15 +1183,6 @@ static pg_data_t __ref *hotadd_init_pgdat(int nid) > */ > build_all_zonelists(pgdat); > > - /* > - * When memory is hot-added, all the memory is in offline state. So > - * clear all zones' present_pages because they will be updated in > - * online_pages() and offline_pages(). > - * TODO: should be in free_area_init_core_hotplug? > - */ > - reset_node_managed_pages(pgdat); > - reset_node_present_pages(pgdat); > - > return pgdat; > } > > @@ -1799,6 +1790,30 @@ static void node_states_clear_node(int node, struct memory_notify *arg) > node_clear_state(node, N_MEMORY); > } > > +static void node_reset_state(int node) > +{ > + pg_data_t *pgdat = NODE_DATA(node); > + int cpu; > + > + kswapd_stop(node); > + kcompactd_stop(node); > + > + reset_node_managed_pages(pgdat); > + reset_node_present_pages(pgdat); > + > + pgdat->nr_zones = 0; > + pgdat->kswapd_order = 0; > + pgdat->kswapd_highest_zoneidx = 0; > + pgdat->node_start_pfn = 0; I'm confused why we have to mess with * present pages * managed pages * node_start_pfn here at all. 1) If there would be any present page left, calling node_reset_state() would be a BUG. 2) If there would be any manged page left, calling node_reset_state() would be a BUG. 3) node_start_pfn will be properly updated by remove_pfn_range_from_zone()->update_pgdat_span() To make it clearer, I *think* touching node_start_pfn is very wrong. What if the node still has ZONE_DEVICE? They don't account towards present pages but only towards spanned pages, and we're messing with the start range. remove_pfn_range_from_zone()->update_pgdat_span() should be the only place that modifies the spanned range when offlining. -- Thanks, David / dhildenb