On Thu 07-04-22 13:17:19, Juergen Gross wrote: > On 07.04.22 13:07, Michal Hocko wrote: > > On Thu 07-04-22 12:45:41, Juergen Gross wrote: > > > On 07.04.22 12:34, Michal Hocko wrote: > > > > Ccing Mel > > > > > > > > On Thu 07-04-22 11:32:21, Juergen Gross wrote: > > > > > Since commit 9d3be21bf9c0 ("mm, page_alloc: simplify zonelist > > > > > initialization") only zones with free memory are included in a built > > > > > zonelist. This is problematic when e.g. all memory of a zone has been > > > > > ballooned out. > > > > > > > > What is the actual problem there? > > > > > > When running as Xen guest new hotplugged memory will not be onlined > > > automatically, but only on special request. This is done in order to > > > support adding e.g. the possibility to use another GB of memory, while > > > adding only a part of that memory initially. > > > > > > In case adding that memory is populating a new zone, the page allocator > > > won't be able to use this memory when it is onlined, as the zone wasn't > > > added to the zonelist, due to managed_zone() returning 0. > > > > How is that memory onlined? Because "regular" onlining (online_pages()) > > does rebuild zonelists if their zone hasn't been populated before. > > The Xen balloon driver has an own callback for onlining pages. The pages > are just added to the ballooned-out page list without handing them to the > allocator. This is done only when the guest is ballooned up. OK, I see. Let me just rephrase to see whether we are on the same page. Xen is overriding the online_page_callback to xen_online_page which doesn't free pages to the page allocator which means that a zone might remain unpopulated after onlining. This means that the default zone lists rebuild is not done and later on when those pages are finally released to the allocator there is no build_all_zonelists happening so those freed pages are not really visible to the allocator via zonelists fallback allocation. Now to your patch. I suspect this is not sufficient for the full hotplug situation. Consider a new NUMA node to be hotadded. hotadd_new_pgdat will call build_all_zonelists but the zone is not populated yet at that moment unless I am missing something. We do rely on online_pages to rebuild once pages are onlined - which usually means they are freed to the page allocator. The zonelists building is kinda messy TBH. I have to say that I am not really clear on Mel's 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator") because as you have said unpoppulated zone is not (or shouldn't be) really all that different from a depleted zone. I think a better and more complete fix would be the following. In other words the zonelists will be built for all present zones. Not sure whether that is going to break 6aa303defb74 though. diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 2a9627dc784c..880c455e2557 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1062,7 +1062,6 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, struct zone *zone, struct memory_group *group) { unsigned long flags; - int need_zonelists_rebuild = 0; const int nid = zone_to_nid(zone); int ret; struct memory_notify arg; @@ -1106,17 +1105,13 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, * This means the page allocator ignores this zone. * So, zonelist must be updated after online. */ - if (!populated_zone(zone)) { - need_zonelists_rebuild = 1; + if (!populated_zone(zone)) setup_zone_pageset(zone); - } online_pages_range(pfn, nr_pages); adjust_present_page_count(pfn_to_page(pfn), group, nr_pages); node_states_set_node(nid, &arg); - if (need_zonelists_rebuild) - build_all_zonelists(NULL); /* Basic onlining is complete, allow allocation of onlined pages. */ undo_isolate_page_range(pfn, pfn + nr_pages, MIGRATE_MOVABLE); @@ -1985,10 +1980,8 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages, /* reinitialise watermarks and update pcp limits */ init_per_zone_wmark_min(); - if (!populated_zone(zone)) { + if (!populated_zone(zone)) zone_pcp_reset(zone); - build_all_zonelists(NULL); - } node_states_clear_node(node, &arg); if (arg.status_change_nid >= 0) { diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 3589febc6d31..130a2feceddc 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6112,10 +6112,8 @@ static int build_zonerefs_node(pg_data_t *pgdat, struct zoneref *zonerefs) do { zone_type--; zone = pgdat->node_zones + zone_type; - if (managed_zone(zone)) { - zoneref_set_zone(zone, &zonerefs[nr_zones++]); - check_highest_zone(zone_type); - } + zoneref_set_zone(zone, &zonerefs[nr_zones++]); + check_highest_zone(zone_type); } while (zone_type); return nr_zones; -- Michal Hocko SUSE Labs