On 29 May 2024, at 21:04, Johannes Weiner wrote: > On Wed, May 29, 2024 at 12:28:35PM -0400, Johannes Weiner wrote: >> Thanks for the config, I was able to reproduce it with. >> >> On Tue, May 28, 2024 at 12:48:05PM -0400, Johannes Weiner wrote: >>> Ah, but there DOES seem to be an issue with how we reserve >>> highatomics: reserving and unreserving happens one pageblock at a >>> time, but MAX_ORDER is usually bigger. If we rmqueue() an order-10 >>> request, reserve_highatomic_block() will only convert the first >>> order-9 block in it; the tail will remain the original type, which >>> will produce a buddy of mixed type blocks upon freeing. >>> >>> This doesn't fully explain the warning here. We'd expect to see it the >>> other way round - passing an assumed type of 3 (HIGHATOMIC) for the >>> remainder that is actually 1 (MOVABLE). But the pageblock-based >>> reservations look fishy. I'll cook up a patch to make this >>> range-based. It might just fix it in a way I'm not seeing just yet. >> >> tl;dr: With some debugging printks, I was able to see the >> issue. Should be a straight-forward fix. >> >> No order-10 allocations are necessary. Instead, smaller allocations >> grab blocks for the highatomic pool. Unreserving is lazy, so as those >> allocations get freed, they have a chance to merge together. Two >> adjacent highatomic blocks can merge (MAX_ORDER > pageblock_order). On >> unreserve, we now have an order-10 highatomic buddy, but only clear >> the type on the first order-9 pageblock. A subsequent alloc + expand >> will warn about this type inconsistency. >> >> [ 411.188518] UNRESERVE: pfn=26000 order=10 >> [ 411.188739] ------------[ cut here ]------------ >> [ 411.188881] 26200: page type is 3, passed migratetype is 1 (nr=512) >> [ 411.189097] WARNING: CPU: 1 PID: 10152 at mm/page_alloc.c:645 expand+0x1c8/0x1f0 >> >> I have a draft patch to make the highatomic reservations update all >> blocks inside the range, not just the first one. I'll send it out as >> soon as I have tested it properly. > > The below fixes the issue for me and survives a few hours of > xfstest/generic/176 for me. > > Christoph, can you please confirm it addresses what you saw? > > Vlastimil, Zi, Mel, can you please take a look? > > Thanks! > > --- > > From a0ba36bb9cd7404c07c7a56139fd52efc6981ced Mon Sep 17 00:00:00 2001 > From: Johannes Weiner <hannes@xxxxxxxxxxx> > Date: Wed, 29 May 2024 18:18:12 -0400 > Subject: [PATCH] mm: page_alloc: fix highatomic typing in multi-block buddies > > Christoph reports a page allocator splat triggered by xfstests: > > generic/176 214s ... [ 1204.507931] run fstests generic/176 at 2024-05-27 12:52:30 > [] XFS (nvme0n1): Mounting V5 Filesystem cd936307-415f-48a3-b99d-a2d52ae1f273 > [] XFS (nvme0n1): Ending clean mount > [] XFS (nvme1n1): Mounting V5 Filesystem ab3ee1a4-af62-4934-9a6a-6c2fde321850 > [] XFS (nvme1n1): Ending clean mount > [] XFS (nvme1n1): Unmounting Filesystem ab3ee1a4-af62-4934-9a6a-6c2fde321850 > [] XFS (nvme1n1): Mounting V5 Filesystem 7099b02d-9c58-4d1d-be1d-2cc472d12cd9 > [] XFS (nvme1n1): Ending clean mount > [] ------------[ cut here ]------------ > [] page type is 3, passed migratetype is 1 (nr=512) > [] WARNING: CPU: 0 PID: 509870 at mm/page_alloc.c:645 expand+0x1c5/0x1f0 > [] Modules linked in: i2c_i801 crc32_pclmul i2c_smbus [last unloaded: scsi_debug] > [] CPU: 0 PID: 509870 Comm: xfs_io Not tainted 6.10.0-rc1+ #2437 > [] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 > [] RIP: 0010:expand+0x1c5/0x1f0 > [] Code: 05 16 70 bf 02 01 e8 ca fc ff ff 8b 54 24 34 44 89 e1 48 c7 c7 80 a2 28 83 48 89 c6 b8 01 00 3 > [] RSP: 0018:ffffc90003b2b968 EFLAGS: 00010082 > [] RAX: 0000000000000000 RBX: ffffffff83fa9480 RCX: 0000000000000000 > [] RDX: 0000000000000005 RSI: 0000000000000027 RDI: 00000000ffffffff > [] RBP: 00000000001f2600 R08: 00000000fffeffff R09: 0000000000000001 > [] R10: 0000000000000000 R11: ffffffff83676200 R12: 0000000000000009 > [] R13: 0000000000000200 R14: 0000000000000001 R15: ffffea0007c98000 > [] FS: 00007f72ca3d5780(0000) GS:ffff8881f9c00000(0000) knlGS:0000000000000000 > [] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [] CR2: 00007f72ca1fff38 CR3: 00000001aa0c6002 CR4: 0000000000770ef0 > [] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400 > [] PKRU: 55555554 > [] Call Trace: > [] <TASK> > [] ? __warn+0x7b/0x120 > [] ? expand+0x1c5/0x1f0 > [] ? report_bug+0x191/0x1c0 > [] ? handle_bug+0x3c/0x80 > [] ? exc_invalid_op+0x17/0x70 > [] ? asm_exc_invalid_op+0x1a/0x20 > [] ? expand+0x1c5/0x1f0 > [] ? expand+0x1c5/0x1f0 > [] __rmqueue_pcplist+0x3a9/0x730 > [] get_page_from_freelist+0x7a0/0xf00 > [] __alloc_pages_noprof+0x153/0x2e0 > [] __folio_alloc_noprof+0x10/0xa0 > [] __filemap_get_folio+0x16b/0x370 > [] iomap_write_begin+0x496/0x680 > > While trying to service a movable allocation (page type 1), the page > allocator runs into a two-pageblock buddy on the movable freelist > whose second block is typed as highatomic (page type 3). > > This inconsistency is caused by the highatomic reservation system > operating on single pageblocks, while MAX_ORDER can be bigger than > that - in this configuration, pageblock_order is 9 while > MAX_PAGE_ORDER is 10. The test case is observed to make several > adjacent order-3 requests with __GFP_DIRECT_RECLAIM cleared, which > marks the surrounding block as highatomic. Upon freeing, the blocks > merge into an order-10 buddy. When the highatomic pool is drained > later on, this order-10 buddy gets moved back to the movable list, but > only the first pageblock is marked movable again. A subsequent > expand() of this buddy warns about the tail being of a different type. > > A similar warning could trigger on an order-10 request that only marks > the first pageblock as highatomic and leaves the second one movable. > > This is a long-standing bug that's surfaced by the recent block type > warnings added to the allocator. The consequences seem mostly benign, > it just results in odd behavior: the highatomic tail blocks are not > properly drained, instead they end up on the movable list first, then > go back to the highatomic list after an alloc-free cycle. > > To fix this, make the highatomic reservation code aware that > allocations/buddies can be larger than a pageblock. > > While it's an old quirk, the recently added type consistency warnings > seem to be the most prominent consequence of it. Set the Fixes: tag > accordingly to highlight this backporting dependency. > > Fixes: e0932b6c1f94 ("mm: page_alloc: consolidate free page accounting") > Reported-by: Christoph Hellwig <hch@xxxxxxxxxxxxx> > Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx> > --- > mm/page_alloc.c | 48 +++++++++++++++++++++++++++++++++--------------- > 1 file changed, 33 insertions(+), 15 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 2e22ce5675ca..754ca5821266 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1955,10 +1955,12 @@ int find_suitable_fallback(struct free_area *area, unsigned int order, > } > > /* > - * Reserve a pageblock for exclusive use of high-order atomic allocations if > - * there are no empty page blocks that contain a page with a suitable order > + * Reserve the pageblock(s) surrounding an allocation request for > + * exclusive use of high-order atomic allocations if there are no > + * empty page blocks that contain a page with a suitable order > */ > -static void reserve_highatomic_pageblock(struct page *page, struct zone *zone) > +static void reserve_highatomic_pageblock(struct page *page, int order, > + struct zone *zone) > { > int mt; > unsigned long max_managed, flags; > @@ -1984,10 +1986,17 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone) > /* Yoink! */ > mt = get_pageblock_migratetype(page); > /* Only reserve normal pageblocks (i.e., they can merge with others) */ > - if (migratetype_is_mergeable(mt)) > - if (move_freepages_block(zone, page, mt, > - MIGRATE_HIGHATOMIC) != -1) > - zone->nr_reserved_highatomic += pageblock_nr_pages; > + if (!migratetype_is_mergeable(mt)) > + goto out_unlock; > + > + if (order < pageblock_order) { > + if (move_freepages_block(zone, page, mt, MIGRATE_HIGHATOMIC) == -1) > + goto out_unlock; > + zone->nr_reserved_highatomic += pageblock_nr_pages; > + } else { > + change_pageblock_range(page, order, MIGRATE_HIGHATOMIC); Don't you also need to move (page, order) to MIGRATE_HIGHATOMIC free list here? > + zone->nr_reserved_highatomic += 1 << order; > + } You could do: zone->nr_reserved_highatomic += max_t(unsigned long, pageblock_nr_pages, 1 << order); instead. But I have no strong opinion on it. > > out_unlock: > spin_unlock_irqrestore(&zone->lock, flags); > @@ -1999,7 +2008,7 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone) > * intense memory pressure but failed atomic allocations should be easier > * to recover from than an OOM. > * > - * If @force is true, try to unreserve a pageblock even though highatomic > + * If @force is true, try to unreserve pageblocks even though highatomic > * pageblock is exhausted. > */ > static bool unreserve_highatomic_pageblock(const struct alloc_context *ac, > @@ -2041,6 +2050,7 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac, > * adjust the count once. > */ > if (is_migrate_highatomic(mt)) { > + unsigned long size; > /* > * It should never happen but changes to > * locking could inadvertently allow a per-cpu > @@ -2048,9 +2058,9 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac, > * while unreserving so be safe and watch for > * underflows. > */ > - zone->nr_reserved_highatomic -= min( > - pageblock_nr_pages, > - zone->nr_reserved_highatomic); > + size = max(pageblock_nr_pages, 1UL << order); > + size = min(size, zone->nr_reserved_highatomic); > + zone->nr_reserved_highatomic -= size; > } > > /* > @@ -2062,11 +2072,19 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac, > * of pageblocks that cannot be completely freed > * may increase. > */ > - ret = move_freepages_block(zone, page, mt, > - ac->migratetype); > + if (order < pageblock_order) > + ret = move_freepages_block(zone, page, mt, > + ac->migratetype); > + else { > + move_to_free_list(page, zone, order, mt, > + ac->migratetype); > + change_pageblock_range(page, order, > + ac->migratetype); > + ret = 1; > + } > /* > - * Reserving this block already succeeded, so this should > - * not fail on zone boundaries. > + * Reserving the block(s) already succeeded, > + * so this should not fail on zone boundaries. > */ > WARN_ON_ONCE(ret == -1); > if (ret > 0) { > -- > 2.45.1 This part looks good to me. -- Best Regards, Yan, Zi
Attachment:
signature.asc
Description: OpenPGP digital signature