Re: page type is 3, passed migratetype is 1 (nr=512)

Johannes Weiner <hannes@xxxxxxxxxxx> · Wed, 29 May 2024 21:04:19 -0400

On Wed, May 29, 2024 at 12:28:35PM -0400, Johannes Weiner wrote:
> Thanks for the config, I was able to reproduce it with.
> 
> On Tue, May 28, 2024 at 12:48:05PM -0400, Johannes Weiner wrote:
> > Ah, but there DOES seem to be an issue with how we reserve
> > highatomics: reserving and unreserving happens one pageblock at a
> > time, but MAX_ORDER is usually bigger. If we rmqueue() an order-10
> > request, reserve_highatomic_block() will only convert the first
> > order-9 block in it; the tail will remain the original type, which
> > will produce a buddy of mixed type blocks upon freeing.
> > 
> > This doesn't fully explain the warning here. We'd expect to see it the
> > other way round - passing an assumed type of 3 (HIGHATOMIC) for the
> > remainder that is actually 1 (MOVABLE). But the pageblock-based
> > reservations look fishy. I'll cook up a patch to make this
> > range-based. It might just fix it in a way I'm not seeing just yet.
> 
> tl;dr: With some debugging printks, I was able to see the
> issue. Should be a straight-forward fix.
> 
> No order-10 allocations are necessary. Instead, smaller allocations
> grab blocks for the highatomic pool. Unreserving is lazy, so as those
> allocations get freed, they have a chance to merge together. Two
> adjacent highatomic blocks can merge (MAX_ORDER > pageblock_order). On
> unreserve, we now have an order-10 highatomic buddy, but only clear
> the type on the first order-9 pageblock. A subsequent alloc + expand
> will warn about this type inconsistency.
> 
> [  411.188518] UNRESERVE: pfn=26000 order=10
> [  411.188739] ------------[ cut here ]------------
> [  411.188881] 26200: page type is 3, passed migratetype is 1 (nr=512)
> [  411.189097] WARNING: CPU: 1 PID: 10152 at mm/page_alloc.c:645 expand+0x1c8/0x1f0
> 
> I have a draft patch to make the highatomic reservations update all
> blocks inside the range, not just the first one. I'll send it out as
> soon as I have tested it properly.

The below fixes the issue for me and survives a few hours of
xfstest/generic/176 for me.

Christoph, can you please confirm it addresses what you saw?

Vlastimil, Zi, Mel, can you please take a look?

Thanks!

---

>From a0ba36bb9cd7404c07c7a56139fd52efc6981ced Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@xxxxxxxxxxx>
Date: Wed, 29 May 2024 18:18:12 -0400
Subject: [PATCH] mm: page_alloc: fix highatomic typing in multi-block buddies

Christoph reports a page allocator splat triggered by xfstests:

generic/176 214s ... [ 1204.507931] run fstests generic/176 at 2024-05-27 12:52:30
[] XFS (nvme0n1): Mounting V5 Filesystem cd936307-415f-48a3-b99d-a2d52ae1f273
[] XFS (nvme0n1): Ending clean mount
[] XFS (nvme1n1): Mounting V5 Filesystem ab3ee1a4-af62-4934-9a6a-6c2fde321850
[] XFS (nvme1n1): Ending clean mount
[] XFS (nvme1n1): Unmounting Filesystem ab3ee1a4-af62-4934-9a6a-6c2fde321850
[] XFS (nvme1n1): Mounting V5 Filesystem 7099b02d-9c58-4d1d-be1d-2cc472d12cd9
[] XFS (nvme1n1): Ending clean mount
[] ------------[ cut here ]------------
[] page type is 3, passed migratetype is 1 (nr=512)
[] WARNING: CPU: 0 PID: 509870 at mm/page_alloc.c:645 expand+0x1c5/0x1f0
[] Modules linked in: i2c_i801 crc32_pclmul i2c_smbus [last unloaded: scsi_debug]
[] CPU: 0 PID: 509870 Comm: xfs_io Not tainted 6.10.0-rc1+ #2437
[] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[] RIP: 0010:expand+0x1c5/0x1f0
[] Code: 05 16 70 bf 02 01 e8 ca fc ff ff 8b 54 24 34 44 89 e1 48 c7 c7 80 a2 28 83 48 89 c6 b8 01 00 3
[] RSP: 0018:ffffc90003b2b968 EFLAGS: 00010082
[] RAX: 0000000000000000 RBX: ffffffff83fa9480 RCX: 0000000000000000
[] RDX: 0000000000000005 RSI: 0000000000000027 RDI: 00000000ffffffff
[] RBP: 00000000001f2600 R08: 00000000fffeffff R09: 0000000000000001
[] R10: 0000000000000000 R11: ffffffff83676200 R12: 0000000000000009
[] R13: 0000000000000200 R14: 0000000000000001 R15: ffffea0007c98000
[] FS:  00007f72ca3d5780(0000) GS:ffff8881f9c00000(0000) knlGS:0000000000000000
[] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[] CR2: 00007f72ca1fff38 CR3: 00000001aa0c6002 CR4: 0000000000770ef0
[] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
[] PKRU: 55555554
[] Call Trace:
[]  <TASK>
[]  ? __warn+0x7b/0x120
[]  ? expand+0x1c5/0x1f0
[]  ? report_bug+0x191/0x1c0
[]  ? handle_bug+0x3c/0x80
[]  ? exc_invalid_op+0x17/0x70
[]  ? asm_exc_invalid_op+0x1a/0x20
[]  ? expand+0x1c5/0x1f0
[]  ? expand+0x1c5/0x1f0
[]  __rmqueue_pcplist+0x3a9/0x730
[]  get_page_from_freelist+0x7a0/0xf00
[]  __alloc_pages_noprof+0x153/0x2e0
[]  __folio_alloc_noprof+0x10/0xa0
[]  __filemap_get_folio+0x16b/0x370
[]  iomap_write_begin+0x496/0x680

While trying to service a movable allocation (page type 1), the page
allocator runs into a two-pageblock buddy on the movable freelist
whose second block is typed as highatomic (page type 3).

This inconsistency is caused by the highatomic reservation system
operating on single pageblocks, while MAX_ORDER can be bigger than
that - in this configuration, pageblock_order is 9 while
MAX_PAGE_ORDER is 10. The test case is observed to make several
adjacent order-3 requests with __GFP_DIRECT_RECLAIM cleared, which
marks the surrounding block as highatomic. Upon freeing, the blocks
merge into an order-10 buddy. When the highatomic pool is drained
later on, this order-10 buddy gets moved back to the movable list, but
only the first pageblock is marked movable again. A subsequent
expand() of this buddy warns about the tail being of a different type.

A similar warning could trigger on an order-10 request that only marks
the first pageblock as highatomic and leaves the second one movable.

This is a long-standing bug that's surfaced by the recent block type
warnings added to the allocator. The consequences seem mostly benign,
it just results in odd behavior: the highatomic tail blocks are not
properly drained, instead they end up on the movable list first, then
go back to the highatomic list after an alloc-free cycle.

To fix this, make the highatomic reservation code aware that
allocations/buddies can be larger than a pageblock.

While it's an old quirk, the recently added type consistency warnings
seem to be the most prominent consequence of it. Set the Fixes: tag
accordingly to highlight this backporting dependency.

Fixes: e0932b6c1f94 ("mm: page_alloc: consolidate free page accounting")
Reported-by: Christoph Hellwig <hch@xxxxxxxxxxxxx>
Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx>
---
 mm/page_alloc.c | 48 +++++++++++++++++++++++++++++++++---------------
 1 file changed, 33 insertions(+), 15 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2e22ce5675ca..754ca5821266 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1955,10 +1955,12 @@ int find_suitable_fallback(struct free_area *area, unsigned int order,
 }
 
 /*
- * Reserve a pageblock for exclusive use of high-order atomic allocations if
- * there are no empty page blocks that contain a page with a suitable order
+ * Reserve the pageblock(s) surrounding an allocation request for
+ * exclusive use of high-order atomic allocations if there are no
+ * empty page blocks that contain a page with a suitable order
  */
-static void reserve_highatomic_pageblock(struct page *page, struct zone *zone)
+static void reserve_highatomic_pageblock(struct page *page, int order,
+					 struct zone *zone)
 {
 	int mt;
 	unsigned long max_managed, flags;
@@ -1984,10 +1986,17 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone)
 	/* Yoink! */
 	mt = get_pageblock_migratetype(page);
 	/* Only reserve normal pageblocks (i.e., they can merge with others) */
-	if (migratetype_is_mergeable(mt))
-		if (move_freepages_block(zone, page, mt,
-					 MIGRATE_HIGHATOMIC) != -1)
-			zone->nr_reserved_highatomic += pageblock_nr_pages;
+	if (!migratetype_is_mergeable(mt))
+		goto out_unlock;
+
+	if (order < pageblock_order) {
+		if (move_freepages_block(zone, page, mt, MIGRATE_HIGHATOMIC) == -1)
+			goto out_unlock;
+		zone->nr_reserved_highatomic += pageblock_nr_pages;
+	} else {
+		change_pageblock_range(page, order, MIGRATE_HIGHATOMIC);
+		zone->nr_reserved_highatomic += 1 << order;
+	}
 
 out_unlock:
 	spin_unlock_irqrestore(&zone->lock, flags);
@@ -1999,7 +2008,7 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone)
  * intense memory pressure but failed atomic allocations should be easier
  * to recover from than an OOM.
  *
- * If @force is true, try to unreserve a pageblock even though highatomic
+ * If @force is true, try to unreserve pageblocks even though highatomic
  * pageblock is exhausted.
  */
 static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
@@ -2041,6 +2050,7 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 			 * adjust the count once.
 			 */
 			if (is_migrate_highatomic(mt)) {
+				unsigned long size;
 				/*
 				 * It should never happen but changes to
 				 * locking could inadvertently allow a per-cpu
@@ -2048,9 +2058,9 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 				 * while unreserving so be safe and watch for
 				 * underflows.
 				 */
-				zone->nr_reserved_highatomic -= min(
-						pageblock_nr_pages,
-						zone->nr_reserved_highatomic);
+				size = max(pageblock_nr_pages, 1UL << order);
+				size = min(size, zone->nr_reserved_highatomic);
+				zone->nr_reserved_highatomic -= size;
 			}
 
 			/*
@@ -2062,11 +2072,19 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 			 * of pageblocks that cannot be completely freed
 			 * may increase.
 			 */
-			ret = move_freepages_block(zone, page, mt,
-						   ac->migratetype);
+			if (order < pageblock_order)
+				ret = move_freepages_block(zone, page, mt,
+							   ac->migratetype);
+			else {
+				move_to_free_list(page, zone, order, mt,
+						  ac->migratetype);
+				change_pageblock_range(page, order,
+						       ac->migratetype);
+				ret = 1;
+			}
 			/*
-			 * Reserving this block already succeeded, so this should
-			 * not fail on zone boundaries.
+			 * Reserving the block(s) already succeeded,
+			 * so this should not fail on zone boundaries.
 			 */
 			WARN_ON_ONCE(ret == -1);
 			if (ret > 0) {
-- 
2.45.1