Re: page type is 3, passed migratetype is 1 (nr=512)

Johannes Weiner <hannes@xxxxxxxxxxx> · Thu, 30 May 2024 07:42:03 -0400

On Wed, May 29, 2024 at 09:04:25PM -0400, Johannes Weiner wrote:
> Subject: [PATCH] mm: page_alloc: fix highatomic typing in multi-block buddies

Argh, I dropped the reserve_highatomic_pageblock() caller update when
removing the printks right before sending out. My apologies. Here is
the fixed version:

---

>From 6aa9498ee0d7161b0605251116d16b18cd448552 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@xxxxxxxxxxx>
Date: Wed, 29 May 2024 18:18:12 -0400
Subject: [PATCH] mm: page_alloc: fix highatomic typing in multi-block buddies

Christoph reports a page allocator splat triggered by xfstests:

generic/176 214s ... [ 1204.507931] run fstests generic/176 at 2024-05-27 12:52:30
[] XFS (nvme0n1): Mounting V5 Filesystem cd936307-415f-48a3-b99d-a2d52ae1f273
[] XFS (nvme0n1): Ending clean mount
[] XFS (nvme1n1): Mounting V5 Filesystem ab3ee1a4-af62-4934-9a6a-6c2fde321850
[] XFS (nvme1n1): Ending clean mount
[] XFS (nvme1n1): Unmounting Filesystem ab3ee1a4-af62-4934-9a6a-6c2fde321850
[] XFS (nvme1n1): Mounting V5 Filesystem 7099b02d-9c58-4d1d-be1d-2cc472d12cd9
[] XFS (nvme1n1): Ending clean mount
[] ------------[ cut here ]------------
[] page type is 3, passed migratetype is 1 (nr=512)
[] WARNING: CPU: 0 PID: 509870 at mm/page_alloc.c:645 expand+0x1c5/0x1f0
[] Modules linked in: i2c_i801 crc32_pclmul i2c_smbus [last unloaded: scsi_debug]
[] CPU: 0 PID: 509870 Comm: xfs_io Not tainted 6.10.0-rc1+ #2437
[] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[] RIP: 0010:expand+0x1c5/0x1f0
[] Code: 05 16 70 bf 02 01 e8 ca fc ff ff 8b 54 24 34 44 89 e1 48 c7 c7 80 a2 28 83 48 89 c6 b8 01 00 3
[] RSP: 0018:ffffc90003b2b968 EFLAGS: 00010082
[] RAX: 0000000000000000 RBX: ffffffff83fa9480 RCX: 0000000000000000
[] RDX: 0000000000000005 RSI: 0000000000000027 RDI: 00000000ffffffff
[] RBP: 00000000001f2600 R08: 00000000fffeffff R09: 0000000000000001
[] R10: 0000000000000000 R11: ffffffff83676200 R12: 0000000000000009
[] R13: 0000000000000200 R14: 0000000000000001 R15: ffffea0007c98000
[] FS:  00007f72ca3d5780(0000) GS:ffff8881f9c00000(0000) knlGS:0000000000000000
[] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[] CR2: 00007f72ca1fff38 CR3: 00000001aa0c6002 CR4: 0000000000770ef0
[] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
[] PKRU: 55555554
[] Call Trace:
[]  <TASK>
[]  ? __warn+0x7b/0x120
[]  ? expand+0x1c5/0x1f0
[]  ? report_bug+0x191/0x1c0
[]  ? handle_bug+0x3c/0x80
[]  ? exc_invalid_op+0x17/0x70
[]  ? asm_exc_invalid_op+0x1a/0x20
[]  ? expand+0x1c5/0x1f0
[]  ? expand+0x1c5/0x1f0
[]  __rmqueue_pcplist+0x3a9/0x730
[]  get_page_from_freelist+0x7a0/0xf00
[]  __alloc_pages_noprof+0x153/0x2e0
[]  __folio_alloc_noprof+0x10/0xa0
[]  __filemap_get_folio+0x16b/0x370
[]  iomap_write_begin+0x496/0x680

While trying to service a movable allocation (page type 1), the page
allocator runs into a two-pageblock buddy on the movable freelist
whose second block is typed as highatomic (page type 3).

This inconsistency is caused by the highatomic reservation system
operating on single pageblocks, while MAX_ORDER can be bigger than
that - in this configuration, pageblock_order is 9 while
MAX_PAGE_ORDER is 10. The test case is observed to make several
adjacent order-3 requests with __GFP_DIRECT_RECLAIM cleared, which
marks the surrounding block as highatomic. Upon freeing, the blocks
merge into an order-10 buddy. When the highatomic pool is drained
later on, this order-10 buddy gets moved back to the movable list, but
only the first pageblock is marked movable again. A subsequent
expand() of this buddy warns about the tail being of a different type.

This is a long-standing bug that's surfaced by the recent block type
warnings added to the allocator. The consequences seem mostly benign,
it just results in odd behavior: the highatomic tail blocks are not
properly drained, instead they end up on the movable list first, then
go back to the highatomic list after an alloc-free cycle.

To fix this, make the highatomic reservation code aware that
allocations/buddies can be larger than a pageblock.

While it's an old quirk, the recently added type consistency warnings
seem to be the most prominent consequence of it. Set the Fixes: tag
accordingly to highlight this backporting dependency.

Fixes: e0932b6c1f94 ("mm: page_alloc: consolidate free page accounting")
Reported-by: Christoph Hellwig <hch@xxxxxxxxxxxxx>
Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx>
---
 mm/page_alloc.c | 50 +++++++++++++++++++++++++++++++++----------------
 1 file changed, 34 insertions(+), 16 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2e22ce5675ca..222299b5c0e6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1955,10 +1955,12 @@ int find_suitable_fallback(struct free_area *area, unsigned int order,
 }
 
 /*
- * Reserve a pageblock for exclusive use of high-order atomic allocations if
- * there are no empty page blocks that contain a page with a suitable order
+ * Reserve the pageblock(s) surrounding an allocation request for
+ * exclusive use of high-order atomic allocations if there are no
+ * empty page blocks that contain a page with a suitable order
  */
-static void reserve_highatomic_pageblock(struct page *page, struct zone *zone)
+static void reserve_highatomic_pageblock(struct page *page, int order,
+					 struct zone *zone)
 {
 	int mt;
 	unsigned long max_managed, flags;
@@ -1984,10 +1986,17 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone)
 	/* Yoink! */
 	mt = get_pageblock_migratetype(page);
 	/* Only reserve normal pageblocks (i.e., they can merge with others) */
-	if (migratetype_is_mergeable(mt))
-		if (move_freepages_block(zone, page, mt,
-					 MIGRATE_HIGHATOMIC) != -1)
-			zone->nr_reserved_highatomic += pageblock_nr_pages;
+	if (!migratetype_is_mergeable(mt))
+		goto out_unlock;
+
+	if (order < pageblock_order) {
+		if (move_freepages_block(zone, page, mt, MIGRATE_HIGHATOMIC) == -1)
+			goto out_unlock;
+		zone->nr_reserved_highatomic += pageblock_nr_pages;
+	} else {
+		change_pageblock_range(page, order, MIGRATE_HIGHATOMIC);
+		zone->nr_reserved_highatomic += 1 << order;
+	}
 
 out_unlock:
 	spin_unlock_irqrestore(&zone->lock, flags);
@@ -1999,7 +2008,7 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone)
  * intense memory pressure but failed atomic allocations should be easier
  * to recover from than an OOM.
  *
- * If @force is true, try to unreserve a pageblock even though highatomic
+ * If @force is true, try to unreserve pageblocks even though highatomic
  * pageblock is exhausted.
  */
 static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
@@ -2041,6 +2050,7 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 			 * adjust the count once.
 			 */
 			if (is_migrate_highatomic(mt)) {
+				unsigned long size;
 				/*
 				 * It should never happen but changes to
 				 * locking could inadvertently allow a per-cpu
@@ -2048,9 +2058,9 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 				 * while unreserving so be safe and watch for
 				 * underflows.
 				 */
-				zone->nr_reserved_highatomic -= min(
-						pageblock_nr_pages,
-						zone->nr_reserved_highatomic);
+				size = max(pageblock_nr_pages, 1UL << order);
+				size = min(size, zone->nr_reserved_highatomic);
+				zone->nr_reserved_highatomic -= size;
 			}
 
 			/*
@@ -2062,11 +2072,19 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 			 * of pageblocks that cannot be completely freed
 			 * may increase.
 			 */
-			ret = move_freepages_block(zone, page, mt,
-						   ac->migratetype);
+			if (order < pageblock_order)
+				ret = move_freepages_block(zone, page, mt,
+							   ac->migratetype);
+			else {
+				move_to_free_list(page, zone, order, mt,
+						  ac->migratetype);
+				change_pageblock_range(page, order,
+						       ac->migratetype);
+				ret = 1;
+			}
 			/*
-			 * Reserving this block already succeeded, so this should
-			 * not fail on zone boundaries.
+			 * Reserving the block(s) already succeeded,
+			 * so this should not fail on zone boundaries.
 			 */
 			WARN_ON_ONCE(ret == -1);
 			if (ret > 0) {
@@ -3406,7 +3424,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 			 * if the pageblock should be reserved for the future
 			 */
 			if (unlikely(alloc_flags & ALLOC_HIGHATOMIC))
-				reserve_highatomic_pageblock(page, zone);
+				reserve_highatomic_pageblock(page, order, zone);
 
 			return page;
 		} else {
-- 
2.45.1