On 23.08.22 17:38, Michal Hocko wrote: > On Tue 23-08-22 16:14:15, Mel Gorman wrote: >> On Tue, Aug 23, 2022 at 03:57:38PM +0200, Michal Hocko wrote: >>>> I think I agree that 6aa303defb74 is most likely not the origin of this. >>>> It could only have been the origin in weird corner cases where we >>>> actually succeed offlining one memory block (adjust present+managed) and >>>> end up with managed=0 and present!=0 -- which barely happens in >>>> practice: especially for ZONE_MOVABLE. (yeah, there is memory ballooning >>>> that adjusts managed pages dynamically and might provoke such a >>>> situation on ZONE_MOVABLE) >>> >>> OK, thanks for the correction David. Then I would agree that Fixes tag >>> could be more confusing than helpful and your above summary would be a >>> great part of the changelog. >>> >> >> Given that 6aa303defb74 still makes it potentially worse, it's as good a >> Fixes-by as any given that anything prior to that commit would need careful >> examination. The race changes shape going further back in time until memory >> hot-remove was initially added and if someone needs to go that far back, >> they'll also need to check if the ZLC needs special treatment. >> >> Provisional patch and changelog is below. I'd still like to get a Tested-by >> from Patrick to confirm it still fixes the problem before posting formally. >> >> --8<-- >> mm/page_alloc: Fix race condition between build_all_zonelists and page allocation >> >> Patrick Daly reported the following problem; >> >> NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK] - before offline operation >> [0] - ZONE_MOVABLE >> [1] - ZONE_NORMAL >> [2] - NULL >> >> For a GFP_KERNEL allocation, alloc_pages_slowpath() will save the >> offset of ZONE_NORMAL in ac->preferred_zoneref. If a concurrent >> memory_offline operation removes the last page from ZONE_MOVABLE, >> build_all_zonelists() & build_zonerefs_node() will update >> node_zonelists as shown below. Only populated zones are added. >> >> NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK] - after offline operation >> [0] - ZONE_NORMAL >> [1] - NULL >> [2] - NULL >> >> The race is simple -- page allocation could be in progress when a memory >> hot-remove operation triggers a zonelist rebuild that removes zones. >> The allocation request will still have a valid ac->preferred_zoneref that >> is now pointing to NULL and triggers an OOM kill. >> >> This problem probably always existed but may be slighly easier to trigger s/slighly/slightly/ >> due to 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones >> with pages managed by the buddy allocator") which distinguishes between >> zones that are completely unpopulated versus zones that have valid pages >> but they are all reserved. Memory hotplug had multiple stages with Not necessarily reserved, simply not managed by the buddy (e.g., early allocations, memory ballooning / virtio-mem). >> timing considerations around managed/present page updates, the zonelist >> rebuild and the zone span updates. As David Hildenbrand puts it >> >> memory offlining adjusts managed+present pages of the zone >> essentially in one go. If after the adjustments, the zone is no >> longer populated (present==0), we rebuild the zone lists. >> >> Once that's done, we try shrinking the zone (start+spanned >> pages) -- which results in zone_start_pfn == 0 if there are no >> more pages. That happens *after* rebuilding the zonelists via >> remove_pfn_range_from_zone(). >> >> The only requirement to fix the race is that a page allocation request >> identifies when a zonelist rebuild has happened since the allocation >> request started and no page has yet been allocated. Use a seqlock_t to track >> zonelist updates with a lockless read-side of the zonelist and protecting >> the rebuild and update of the counter with a spinlock. >> >> Fixes: 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator") >> Reported-by: Patrick Daly <quic_pdaly@xxxxxxxxxxx> >> Signed-off-by: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> >> Cc: <stable@xxxxxxxxxxxxxxx> >> --- >> mm/page_alloc.c | 37 ++++++++++++++++++++++++++++++------- >> 1 file changed, 30 insertions(+), 7 deletions(-) >> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >> index e5486d47406e..216e21048ddf 100644 >> --- a/mm/page_alloc.c >> +++ b/mm/page_alloc.c >> @@ -4708,6 +4708,24 @@ void fs_reclaim_release(gfp_t gfp_mask) >> EXPORT_SYMBOL_GPL(fs_reclaim_release); >> #endif >> >> +/* >> + * Zonelists may change due to hotplug during allocation. Detect when zonelists >> + * have been rebuilt so allocation retries. Reader side does not lock and >> + * retries the allocation if zonelist changes. Writer side is protected by the >> + * embedded spin_lock. >> + */ >> +DEFINE_SEQLOCK(zonelist_update_seq); >> + >> +static unsigned int zonelist_iter_begin(void) >> +{ > > You likely want something like > if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE)) > return read_seqbegin(&zonelist_update_seq); > return 0; > >> + return read_seqbegin(&zonelist_update_seq); >> +} >> + >> +static unsigned int check_retry_zonelist(unsigned int seq) >> +{ > if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE)) > return read_seqretry(&zonelist_update_seq, seq); > return seq; > >> + return read_seqretry(&zonelist_update_seq, seq); >> +} >> + > > to avoid overhead on systems without HOTREMOVE configured. > > Other than that LGTM. > Acked-by: Michal Hocko <mhocko@xxxxxxxx> Makes sense to me, although I wonder how much it will matter in practice. Reviewed-by: David Hildenbrand <david@xxxxxxxxxx> -- Thanks, David / dhildenb