On Tue 23-08-22 16:14:15, Mel Gorman wrote: > On Tue, Aug 23, 2022 at 03:57:38PM +0200, Michal Hocko wrote: > > > I think I agree that 6aa303defb74 is most likely not the origin of this. > > > It could only have been the origin in weird corner cases where we > > > actually succeed offlining one memory block (adjust present+managed) and > > > end up with managed=0 and present!=0 -- which barely happens in > > > practice: especially for ZONE_MOVABLE. (yeah, there is memory ballooning > > > that adjusts managed pages dynamically and might provoke such a > > > situation on ZONE_MOVABLE) > > > > OK, thanks for the correction David. Then I would agree that Fixes tag > > could be more confusing than helpful and your above summary would be a > > great part of the changelog. > > > > Given that 6aa303defb74 still makes it potentially worse, it's as good a > Fixes-by as any given that anything prior to that commit would need careful > examination. The race changes shape going further back in time until memory > hot-remove was initially added and if someone needs to go that far back, > they'll also need to check if the ZLC needs special treatment. > > Provisional patch and changelog is below. I'd still like to get a Tested-by > from Patrick to confirm it still fixes the problem before posting formally. > > --8<-- > mm/page_alloc: Fix race condition between build_all_zonelists and page allocation > > Patrick Daly reported the following problem; > > NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK] - before offline operation > [0] - ZONE_MOVABLE > [1] - ZONE_NORMAL > [2] - NULL > > For a GFP_KERNEL allocation, alloc_pages_slowpath() will save the > offset of ZONE_NORMAL in ac->preferred_zoneref. If a concurrent > memory_offline operation removes the last page from ZONE_MOVABLE, > build_all_zonelists() & build_zonerefs_node() will update > node_zonelists as shown below. Only populated zones are added. > > NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK] - after offline operation > [0] - ZONE_NORMAL > [1] - NULL > [2] - NULL > > The race is simple -- page allocation could be in progress when a memory > hot-remove operation triggers a zonelist rebuild that removes zones. > The allocation request will still have a valid ac->preferred_zoneref that > is now pointing to NULL and triggers an OOM kill. > > This problem probably always existed but may be slighly easier to trigger > due to 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones > with pages managed by the buddy allocator") which distinguishes between > zones that are completely unpopulated versus zones that have valid pages > but they are all reserved. Memory hotplug had multiple stages with > timing considerations around managed/present page updates, the zonelist > rebuild and the zone span updates. As David Hildenbrand puts it > > memory offlining adjusts managed+present pages of the zone > essentially in one go. If after the adjustments, the zone is no > longer populated (present==0), we rebuild the zone lists. > > Once that's done, we try shrinking the zone (start+spanned > pages) -- which results in zone_start_pfn == 0 if there are no > more pages. That happens *after* rebuilding the zonelists via > remove_pfn_range_from_zone(). > > The only requirement to fix the race is that a page allocation request > identifies when a zonelist rebuild has happened since the allocation > request started and no page has yet been allocated. Use a seqlock_t to track > zonelist updates with a lockless read-side of the zonelist and protecting > the rebuild and update of the counter with a spinlock. > > Fixes: 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator") > Reported-by: Patrick Daly <quic_pdaly@xxxxxxxxxxx> > Signed-off-by: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> > Cc: <stable@xxxxxxxxxxxxxxx> > --- > mm/page_alloc.c | 37 ++++++++++++++++++++++++++++++------- > 1 file changed, 30 insertions(+), 7 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index e5486d47406e..216e21048ddf 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -4708,6 +4708,24 @@ void fs_reclaim_release(gfp_t gfp_mask) > EXPORT_SYMBOL_GPL(fs_reclaim_release); > #endif > > +/* > + * Zonelists may change due to hotplug during allocation. Detect when zonelists > + * have been rebuilt so allocation retries. Reader side does not lock and > + * retries the allocation if zonelist changes. Writer side is protected by the > + * embedded spin_lock. > + */ > +DEFINE_SEQLOCK(zonelist_update_seq); > + > +static unsigned int zonelist_iter_begin(void) > +{ You likely want something like if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE)) return read_seqbegin(&zonelist_update_seq); return 0; > + return read_seqbegin(&zonelist_update_seq); > +} > + > +static unsigned int check_retry_zonelist(unsigned int seq) > +{ if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE)) return read_seqretry(&zonelist_update_seq, seq); return seq; > + return read_seqretry(&zonelist_update_seq, seq); > +} > + to avoid overhead on systems without HOTREMOVE configured. Other than that LGTM. Acked-by: Michal Hocko <mhocko@xxxxxxxx> Thanks! > /* Perform direct synchronous page reclaim */ > static unsigned long > __perform_reclaim(gfp_t gfp_mask, unsigned int order, > @@ -5001,6 +5019,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > int compaction_retries; > int no_progress_loops; > unsigned int cpuset_mems_cookie; > + unsigned int zonelist_iter_cookie; > int reserve_flags; > > /* > @@ -5011,11 +5030,12 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM))) > gfp_mask &= ~__GFP_ATOMIC; > > -retry_cpuset: > +restart: > compaction_retries = 0; > no_progress_loops = 0; > compact_priority = DEF_COMPACT_PRIORITY; > cpuset_mems_cookie = read_mems_allowed_begin(); > + zonelist_iter_cookie = zonelist_iter_begin(); > > /* > * The fast path uses conservative alloc_flags to succeed only until > @@ -5187,9 +5207,13 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > goto retry; > > > - /* Deal with possible cpuset update races before we start OOM killing */ > - if (check_retry_cpuset(cpuset_mems_cookie, ac)) > - goto retry_cpuset; > + /* > + * Deal with possible cpuset update races or zonelist updates to avoid > + * a unnecessary OOM kill. > + */ > + if (check_retry_cpuset(cpuset_mems_cookie, ac) || > + check_retry_zonelist(zonelist_iter_cookie)) > + goto restart; > > /* Reclaim has failed us, start killing things */ > page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress); > @@ -6514,9 +6538,8 @@ static void __build_all_zonelists(void *data) > int nid; > int __maybe_unused cpu; > pg_data_t *self = data; > - static DEFINE_SPINLOCK(lock); > > - spin_lock(&lock); > + write_seqlock(&zonelist_update_seq); > > #ifdef CONFIG_NUMA > memset(node_load, 0, sizeof(node_load)); > @@ -6553,7 +6576,7 @@ static void __build_all_zonelists(void *data) > #endif > } > > - spin_unlock(&lock); > + write_sequnlock(&zonelist_update_seq); > } > > static noinline void __init -- Michal Hocko SUSE Labs