On 07.04.22 13:40, Michal Hocko wrote: > On Thu 07-04-22 13:17:19, Juergen Gross wrote: >> On 07.04.22 13:07, Michal Hocko wrote: >>> On Thu 07-04-22 12:45:41, Juergen Gross wrote: >>>> On 07.04.22 12:34, Michal Hocko wrote: >>>>> Ccing Mel >>>>> >>>>> On Thu 07-04-22 11:32:21, Juergen Gross wrote: >>>>>> Since commit 9d3be21bf9c0 ("mm, page_alloc: simplify zonelist >>>>>> initialization") only zones with free memory are included in a built >>>>>> zonelist. This is problematic when e.g. all memory of a zone has been >>>>>> ballooned out. >>>>> >>>>> What is the actual problem there? >>>> >>>> When running as Xen guest new hotplugged memory will not be onlined >>>> automatically, but only on special request. This is done in order to >>>> support adding e.g. the possibility to use another GB of memory, while >>>> adding only a part of that memory initially. >>>> >>>> In case adding that memory is populating a new zone, the page allocator >>>> won't be able to use this memory when it is onlined, as the zone wasn't >>>> added to the zonelist, due to managed_zone() returning 0. >>> >>> How is that memory onlined? Because "regular" onlining (online_pages()) >>> does rebuild zonelists if their zone hasn't been populated before. >> >> The Xen balloon driver has an own callback for onlining pages. The pages >> are just added to the ballooned-out page list without handing them to the >> allocator. This is done only when the guest is ballooned up. > > OK, I see. Let me just rephrase to see whether we are on the same page. > Xen is overriding the online_page_callback to xen_online_page which > doesn't free pages to the page allocator which means that a zone might > remain unpopulated after onlining. This means that the default zone > lists rebuild is not done and later on when those pages are finally > released to the allocator there is no build_all_zonelists happening so > those freed pages are not really visible to the allocator via zonelists > fallback allocation. > > Now to your patch. I suspect this is not sufficient for the full hotplug > situation. Consider a new NUMA node to be hotadded. hotadd_new_pgdat > will call build_all_zonelists but the zone is not populated yet at that > moment unless I am missing something. We do rely on online_pages to > rebuild once pages are onlined - which usually means they are freed to > the page allocator. > > The zonelists building is kinda messy TBH. I have to say that I am not > really clear on Mel's 6aa303defb74 ("mm, vmscan: only allocate and > reclaim from zones with pages managed by the buddy allocator") because > as you have said unpoppulated zone is not (or shouldn't be) really all > that different from a depleted zone. > > I think a better and more complete fix would be the following. In other > words the zonelists will be built for all present zones. Not sure > whether that is going to break 6aa303defb74 though. > > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > index 2a9627dc784c..880c455e2557 100644 > --- a/mm/memory_hotplug.c > +++ b/mm/memory_hotplug.c > @@ -1062,7 +1062,6 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, > struct zone *zone, struct memory_group *group) > { > unsigned long flags; > - int need_zonelists_rebuild = 0; > const int nid = zone_to_nid(zone); > int ret; > struct memory_notify arg; > @@ -1106,17 +1105,13 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, > * This means the page allocator ignores this zone. > * So, zonelist must be updated after online. > */ > - if (!populated_zone(zone)) { > - need_zonelists_rebuild = 1; > + if (!populated_zone(zone)) > setup_zone_pageset(zone); > - } > > online_pages_range(pfn, nr_pages); > adjust_present_page_count(pfn_to_page(pfn), group, nr_pages); > > node_states_set_node(nid, &arg); > - if (need_zonelists_rebuild) > - build_all_zonelists(NULL); > > /* Basic onlining is complete, allow allocation of onlined pages. */ > undo_isolate_page_range(pfn, pfn + nr_pages, MIGRATE_MOVABLE); > @@ -1985,10 +1980,8 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages, > /* reinitialise watermarks and update pcp limits */ > init_per_zone_wmark_min(); > > - if (!populated_zone(zone)) { > + if (!populated_zone(zone)) > zone_pcp_reset(zone); > - build_all_zonelists(NULL); > - } > > node_states_clear_node(node, &arg); > if (arg.status_change_nid >= 0) { > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 3589febc6d31..130a2feceddc 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -6112,10 +6112,8 @@ static int build_zonerefs_node(pg_data_t *pgdat, struct zoneref *zonerefs) > do { > zone_type--; > zone = pgdat->node_zones + zone_type; > - if (managed_zone(zone)) { > - zoneref_set_zone(zone, &zonerefs[nr_zones++]); > - check_highest_zone(zone_type); > - } > + zoneref_set_zone(zone, &zonerefs[nr_zones++]); > + check_highest_zone(zone_type); > } while (zone_type); > > return nr_zones; I don't think having !populated zones in the zonelist is a particularly good idea. Populated vs !populated changes only during page onlininge/offlining. If I'm not wrong, with your patch we'd even include ZONE_DEVICE here ... I'd vote for going with the simple fix first, which should be good enough AFAIKT. -- Thanks, David / dhildenb