Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene

Johannes Weiner <hannes@xxxxxxxxxxx> · Wed, 20 Sep 2023 12:04:00 -0400

On Wed, Sep 20, 2023 at 09:48:12AM -0400, Johannes Weiner wrote:
> On Wed, Sep 20, 2023 at 08:07:53AM +0200, Vlastimil Babka wrote:
> > On 9/20/23 03:38, Zi Yan wrote:
> > > On 19 Sep 2023, at 20:32, Mike Kravetz wrote:
> > > 
> > >> On 09/19/23 16:57, Zi Yan wrote:
> > >>> On 19 Sep 2023, at 14:47, Mike Kravetz wrote:
> > >>>
> > >>>> 	--- a/mm/page_alloc.c
> > >>>> 	+++ b/mm/page_alloc.c
> > >>>> 	@@ -1651,8 +1651,13 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page,
> > >>>>  		end = pageblock_end_pfn(pfn) - 1;
> > >>>>
> > >>>>  		/* Do not cross zone boundaries */
> > >>>> 	+#if 0
> > >>>>  		if (!zone_spans_pfn(zone, start))
> > >>>> 			start = zone->zone_start_pfn;
> > >>>> 	+#else
> > >>>> 	+	if (!zone_spans_pfn(zone, start))
> > >>>> 	+		start = pfn;
> > >>>> 	+#endif
> > >>>> 	 	if (!zone_spans_pfn(zone, end))
> > >>>> 	 		return false;
> > >>>> 	I can still trigger warnings.
> > >>>
> > >>> OK. One thing to note is that the page type in the warning changed from
> > >>> 5 (MIGRATE_ISOLATE) to 0 (MIGRATE_UNMOVABLE) with my suggested change.
> > >>>
> > >>
> > >> Just to be really clear,
> > >> - the 5 (MIGRATE_ISOLATE) warning was from the __alloc_pages call path.
> > >> - the 0 (MIGRATE_UNMOVABLE) as above was from the alloc_contig_range call
> > >>   path WITHOUT your change.
> > >>
> > >> I am guessing the difference here has more to do with the allocation path?
> > >>
> > >> I went back and reran focusing on the specific migrate type.
> > >> Without your patch, and coming from the alloc_contig_range call path,
> > >> I got two warnings of 'page type is 0, passed migratetype is 1' as above.
> > >> With your patch I got one 'page type is 0, passed migratetype is 1'
> > >> warning and one 'page type is 1, passed migratetype is 0' warning.
> > >>
> > >> I could be wrong, but I do not think your patch changes things.
> > > 
> > > Got it. Thanks for the clarification.
> > >>
> > >>>>
> > >>>> One idea about recreating the issue is that it may have to do with size
> > >>>> of my VM (16G) and the requested allocation sizes 4G.  However, I tried
> > >>>> to really stress the allocations by increasing the number of hugetlb
> > >>>> pages requested and that did not help.  I also noticed that I only seem
> > >>>> to get two warnings and then they stop, even if I continue to run the
> > >>>> script.
> > >>>>
> > >>>> Zi asked about my config, so it is attached.
> > >>>
> > >>> With your config, I still have no luck reproducing the issue. I will keep
> > >>> trying. Thanks.
> > >>>
> > >>
> > >> Perhaps try running both scripts in parallel?
> > > 
> > > Yes. It seems to do the trick.
> > > 
> > >> Adjust the number of hugetlb pages allocated to equal 25% of memory?
> > > 
> > > I am able to reproduce it with the script below:
> > > 
> > > while true; do
> > >  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages&
> > >  echo 2048 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages&
> > >  wait
> > >  echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> > >  echo 0 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> > > done
> > > 
> > > I will look into the issue.
> 
> Nice!
> 
> I managed to reproduce it ONCE, triggering it not even a second after
> starting the script. But I can't seem to do it twice, even after
> several reboots and letting it run for minutes.

I managed to reproduce it reliably by cutting the nr_hugepages
parameters respectively in half.

The one that triggers for me is always MIGRATE_ISOLATE. With some
printk-tracing, the scenario seems to be this:

#0                                                   #1
start_isolate_page_range()
  isolate_single_pageblock()
    set_migratetype_isolate(tail)
      lock zone->lock
      move_freepages_block(tail) // nop
      set_pageblock_migratetype(tail)
      unlock zone->lock
                                                     del_page_from_freelist(head)
                                                     expand(head, head_mt)
                                                       WARN(head_mt != tail_mt)
    start_pfn = ALIGN_DOWN(MAX_ORDER_NR_PAGES)
    for (pfn = start_pfn, pfn < end_pfn)
      if (PageBuddy())
        split_free_page(head)

IOW, we update a pageblock that isn't MAX_ORDER aligned, then drop the
lock. The move_freepages_block() does nothing because the PageBuddy()
is set on the pageblock to the left. Once we drop the lock, the buddy
gets allocated and the expand() puts things on the wrong list. The
splitting code that handles MAX_ORDER blocks runs *after* the tail
type is set and the lock has been dropped, so it's too late.

I think this would work fine if we always set MIGRATE_ISOLATE in a
linear fashion, with start and end aligned to MAX_ORDER. Then we also
wouldn't have to split things.

There are two reasons this doesn't happen today:

1. The isolation range is rounded to pageblocks, not MAX_ORDER. In
   this test case they always seem aligned, but it's not
   guaranteed. However,

2. start_isolate_page_range() explicitly breaks ordering by doing the
   last block in the range before the center. It's that last block
   that triggers the race with __rmqueue_smallest -> expand() for me.

With the below patch I can no longer reproduce the issue:

---

diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index b5c7a9d21257..b7c8730bf0e2 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -538,8 +538,8 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
 	unsigned long pfn;
 	struct page *page;
 	/* isolation is done at page block granularity */
-	unsigned long isolate_start = pageblock_start_pfn(start_pfn);
-	unsigned long isolate_end = pageblock_align(end_pfn);
+	unsigned long isolate_start = ALIGN_DOWN(start_pfn, MAX_ORDER_NR_PAGES);
+	unsigned long isolate_end = ALIGN(end_pfn, MAX_ORDER_NR_PAGES);
 	int ret;
 	bool skip_isolation = false;
 
@@ -549,17 +549,6 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
 	if (ret)
 		return ret;
 
-	if (isolate_start == isolate_end - pageblock_nr_pages)
-		skip_isolation = true;
-
-	/* isolate [isolate_end - pageblock_nr_pages, isolate_end) pageblock */
-	ret = isolate_single_pageblock(isolate_end, flags, gfp_flags, true,
-			skip_isolation, migratetype);
-	if (ret) {
-		unset_migratetype_isolate(pfn_to_page(isolate_start), migratetype);
-		return ret;
-	}
-
 	/* skip isolated pageblocks at the beginning and end */
 	for (pfn = isolate_start + pageblock_nr_pages;
 	     pfn < isolate_end - pageblock_nr_pages;
@@ -568,12 +557,21 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
 		if (page && set_migratetype_isolate(page, migratetype, flags,
 					start_pfn, end_pfn)) {
 			undo_isolate_page_range(isolate_start, pfn, migratetype);
-			unset_migratetype_isolate(
-				pfn_to_page(isolate_end - pageblock_nr_pages),
-				migratetype);
 			return -EBUSY;
 		}
 	}
+
+	if (isolate_start == isolate_end - pageblock_nr_pages)
+		skip_isolation = true;
+
+	/* isolate [isolate_end - pageblock_nr_pages, isolate_end) pageblock */
+	ret = isolate_single_pageblock(isolate_end, flags, gfp_flags, true,
+			skip_isolation, migratetype);
+	if (ret) {
+		undo_isolate_page_range(isolate_start, pfn, migratetype);
+		return ret;
+	}
+
 	return 0;
 }
 
@@ -591,8 +589,8 @@ void undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
 {
 	unsigned long pfn;
 	struct page *page;
-	unsigned long isolate_start = pageblock_start_pfn(start_pfn);
-	unsigned long isolate_end = pageblock_align(end_pfn);
+	unsigned long isolate_start = ALIGN_DOWN(start_pfn, MAX_ORDER_NR_PAGES);
+	unsigned long isolate_end = ALIGN(end_pfn, MAX_ORDER_NR_PAGES);
 
 	for (pfn = isolate_start;
 	     pfn < isolate_end;