On Tue 23-08-22 15:50:23, David Hildenbrand wrote: > On 23.08.22 15:25, Michal Hocko wrote: > > On Tue 23-08-22 13:58:50, Mel Gorman wrote: > >> On Tue, Aug 23, 2022 at 02:18:27PM +0200, Michal Hocko wrote: > >>> On Tue 23-08-22 12:09:46, Mel Gorman wrote: > >>>> On Tue, Aug 23, 2022 at 12:34:09PM +0200, David Hildenbrand wrote: > >>>>>> @@ -6553,7 +6576,7 @@ static void __build_all_zonelists(void *data) > >>>>>> #endif > >>>>>> } > >>>>>> > >>>>>> - spin_unlock(&lock); > >>>>>> + write_sequnlock(&zonelist_update_seq); > >>>>>> } > >>>>>> > >>>>>> static noinline void __init > >>>>>> > >>>>> > >>>>> LGTM. The "retry_cpuset" label might deserve a better name now. > >>>>> > >>>> > >>>> Good point ... "restart"? > >>>> > >>>>> Would > >>>>> > >>>>> Fixes: 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones > >>>>> with pages managed by the buddy allocator") > >>>>> > >>>>> be correct? > >>>>> > >>>> > >>>> Not specifically because the bug is due to a zone being completely removed > >>>> resulting in a rebuild. This race probably existed ever since memory > >>>> hotremove could theoritically remove a complete zone. A Cc: Stable would > >>>> be appropriate as it'll apply with fuzz back to at least 5.4.210 but beyond > >>>> that, it should be driven by a specific bug report showing that hot-remove > >>>> of a full zone was possible and triggered the race. > >>> > >>> I do not think so. 6aa303defb74 has changed the zonelist building and > >>> changed the check from pfn range (populated) to managed (with a memory). > >> > >> I'm not 100% convinced. The present_pages should have been the spanned range > >> minus any holes that exist in the zone. If the zone is completely removed, > >> the span should be zero meaning present and managed are both zero. No? > > > > IIRC, and David will correct me if I am mixing this up. The difference > > is that zonelists are rebuilt during memory offlining and that is when > > managed pages are removed from the allocator. Zone itself still has that > > physical range populated and so this patch would have made a difference. > > To recap, memory offlining adjusts managed+present pages of the zone > essentially in one go. If after the adjustments, the zone is no longer > populated (present==0), we rebuild the zone lists. > > Once that's done, we try shrinking the zone (start+spanned pages) -- > which results in zone_start_pfn == 0 if there are no more pages. That > happens *after* rebuilding the zonelists via remove_pfn_range_from_zone(). > > > Note that populated_zone() checks for present_pages. The actual zone > span (e.g., spanned_pages) is a different story and not of interest when > building zones or wanting to allocate memory. > > > > > Now, you are right that this is likely possible even without that commit > > but it is highly unlikely because physical hotremove is a very rare > > operation and the race window would be so large that it would be likely > > unfeasible. > > I think I agree that 6aa303defb74 is most likely not the origin of this. > It could only have been the origin in weird corner cases where we > actually succeed offlining one memory block (adjust present+managed) and > end up with managed=0 and present!=0 -- which barely happens in > practice: especially for ZONE_MOVABLE. (yeah, there is memory ballooning > that adjusts managed pages dynamically and might provoke such a > situation on ZONE_MOVABLE) OK, thanks for the correction David. Then I would agree that Fixes tag could be more confusing than helpful and your above summary would be a great part of the changelog. Thanks! -- Michal Hocko SUSE Labs