Re: page type is 3, passed migratetype is 1 (nr=512)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, May 29, 2024 at 09:51:24PM -0400, Zi Yan wrote:
> On 29 May 2024, at 21:04, Johannes Weiner wrote:
> 
> > On Wed, May 29, 2024 at 12:28:35PM -0400, Johannes Weiner wrote:
> >> Thanks for the config, I was able to reproduce it with.
> >>
> >> On Tue, May 28, 2024 at 12:48:05PM -0400, Johannes Weiner wrote:
> >>> Ah, but there DOES seem to be an issue with how we reserve
> >>> highatomics: reserving and unreserving happens one pageblock at a
> >>> time, but MAX_ORDER is usually bigger. If we rmqueue() an order-10
> >>> request, reserve_highatomic_block() will only convert the first
> >>> order-9 block in it; the tail will remain the original type, which
> >>> will produce a buddy of mixed type blocks upon freeing.
> >>>
> >>> This doesn't fully explain the warning here. We'd expect to see it the
> >>> other way round - passing an assumed type of 3 (HIGHATOMIC) for the
> >>> remainder that is actually 1 (MOVABLE). But the pageblock-based
> >>> reservations look fishy. I'll cook up a patch to make this
> >>> range-based. It might just fix it in a way I'm not seeing just yet.
> >>
> >> tl;dr: With some debugging printks, I was able to see the
> >> issue. Should be a straight-forward fix.
> >>
> >> No order-10 allocations are necessary. Instead, smaller allocations
> >> grab blocks for the highatomic pool. Unreserving is lazy, so as those
> >> allocations get freed, they have a chance to merge together. Two
> >> adjacent highatomic blocks can merge (MAX_ORDER > pageblock_order). On
> >> unreserve, we now have an order-10 highatomic buddy, but only clear
> >> the type on the first order-9 pageblock. A subsequent alloc + expand
> >> will warn about this type inconsistency.
> >>
> >> [  411.188518] UNRESERVE: pfn=26000 order=10
> >> [  411.188739] ------------[ cut here ]------------
> >> [  411.188881] 26200: page type is 3, passed migratetype is 1 (nr=512)
> >> [  411.189097] WARNING: CPU: 1 PID: 10152 at mm/page_alloc.c:645 expand+0x1c8/0x1f0
> >>
> >> I have a draft patch to make the highatomic reservations update all
> >> blocks inside the range, not just the first one. I'll send it out as
> >> soon as I have tested it properly.
> >
> > The below fixes the issue for me and survives a few hours of
> > xfstest/generic/176 for me.
> >
> > Christoph, can you please confirm it addresses what you saw?
> >
> > Vlastimil, Zi, Mel, can you please take a look?
> >
> > Thanks!
> >
> > ---
> >
> > From a0ba36bb9cd7404c07c7a56139fd52efc6981ced Mon Sep 17 00:00:00 2001
> > From: Johannes Weiner <hannes@xxxxxxxxxxx>
> > Date: Wed, 29 May 2024 18:18:12 -0400
> > Subject: [PATCH] mm: page_alloc: fix highatomic typing in multi-block buddies
> >
> > Christoph reports a page allocator splat triggered by xfstests:
> >
> > generic/176 214s ... [ 1204.507931] run fstests generic/176 at 2024-05-27 12:52:30
> > [] XFS (nvme0n1): Mounting V5 Filesystem cd936307-415f-48a3-b99d-a2d52ae1f273
> > [] XFS (nvme0n1): Ending clean mount
> > [] XFS (nvme1n1): Mounting V5 Filesystem ab3ee1a4-af62-4934-9a6a-6c2fde321850
> > [] XFS (nvme1n1): Ending clean mount
> > [] XFS (nvme1n1): Unmounting Filesystem ab3ee1a4-af62-4934-9a6a-6c2fde321850
> > [] XFS (nvme1n1): Mounting V5 Filesystem 7099b02d-9c58-4d1d-be1d-2cc472d12cd9
> > [] XFS (nvme1n1): Ending clean mount
> > [] ------------[ cut here ]------------
> > [] page type is 3, passed migratetype is 1 (nr=512)
> > [] WARNING: CPU: 0 PID: 509870 at mm/page_alloc.c:645 expand+0x1c5/0x1f0
> > [] Modules linked in: i2c_i801 crc32_pclmul i2c_smbus [last unloaded: scsi_debug]
> > [] CPU: 0 PID: 509870 Comm: xfs_io Not tainted 6.10.0-rc1+ #2437
> > [] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> > [] RIP: 0010:expand+0x1c5/0x1f0
> > [] Code: 05 16 70 bf 02 01 e8 ca fc ff ff 8b 54 24 34 44 89 e1 48 c7 c7 80 a2 28 83 48 89 c6 b8 01 00 3
> > [] RSP: 0018:ffffc90003b2b968 EFLAGS: 00010082
> > [] RAX: 0000000000000000 RBX: ffffffff83fa9480 RCX: 0000000000000000
> > [] RDX: 0000000000000005 RSI: 0000000000000027 RDI: 00000000ffffffff
> > [] RBP: 00000000001f2600 R08: 00000000fffeffff R09: 0000000000000001
> > [] R10: 0000000000000000 R11: ffffffff83676200 R12: 0000000000000009
> > [] R13: 0000000000000200 R14: 0000000000000001 R15: ffffea0007c98000
> > [] FS:  00007f72ca3d5780(0000) GS:ffff8881f9c00000(0000) knlGS:0000000000000000
> > [] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [] CR2: 00007f72ca1fff38 CR3: 00000001aa0c6002 CR4: 0000000000770ef0
> > [] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
> > [] PKRU: 55555554
> > [] Call Trace:
> > []  <TASK>
> > []  ? __warn+0x7b/0x120
> > []  ? expand+0x1c5/0x1f0
> > []  ? report_bug+0x191/0x1c0
> > []  ? handle_bug+0x3c/0x80
> > []  ? exc_invalid_op+0x17/0x70
> > []  ? asm_exc_invalid_op+0x1a/0x20
> > []  ? expand+0x1c5/0x1f0
> > []  ? expand+0x1c5/0x1f0
> > []  __rmqueue_pcplist+0x3a9/0x730
> > []  get_page_from_freelist+0x7a0/0xf00
> > []  __alloc_pages_noprof+0x153/0x2e0
> > []  __folio_alloc_noprof+0x10/0xa0
> > []  __filemap_get_folio+0x16b/0x370
> > []  iomap_write_begin+0x496/0x680
> >
> > While trying to service a movable allocation (page type 1), the page
> > allocator runs into a two-pageblock buddy on the movable freelist
> > whose second block is typed as highatomic (page type 3).
> >
> > This inconsistency is caused by the highatomic reservation system
> > operating on single pageblocks, while MAX_ORDER can be bigger than
> > that - in this configuration, pageblock_order is 9 while
> > MAX_PAGE_ORDER is 10. The test case is observed to make several
> > adjacent order-3 requests with __GFP_DIRECT_RECLAIM cleared, which
> > marks the surrounding block as highatomic. Upon freeing, the blocks
> > merge into an order-10 buddy. When the highatomic pool is drained
> > later on, this order-10 buddy gets moved back to the movable list, but
> > only the first pageblock is marked movable again. A subsequent
> > expand() of this buddy warns about the tail being of a different type.
> >
> > A similar warning could trigger on an order-10 request that only marks
> > the first pageblock as highatomic and leaves the second one movable.
> >
> > This is a long-standing bug that's surfaced by the recent block type
> > warnings added to the allocator. The consequences seem mostly benign,
> > it just results in odd behavior: the highatomic tail blocks are not
> > properly drained, instead they end up on the movable list first, then
> > go back to the highatomic list after an alloc-free cycle.
> >
> > To fix this, make the highatomic reservation code aware that
> > allocations/buddies can be larger than a pageblock.
> >
> > While it's an old quirk, the recently added type consistency warnings
> > seem to be the most prominent consequence of it. Set the Fixes: tag
> > accordingly to highlight this backporting dependency.
> >
> > Fixes: e0932b6c1f94 ("mm: page_alloc: consolidate free page accounting")
> > Reported-by: Christoph Hellwig <hch@xxxxxxxxxxxxx>
> > Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx>
> > ---
> >  mm/page_alloc.c | 48 +++++++++++++++++++++++++++++++++---------------
> >  1 file changed, 33 insertions(+), 15 deletions(-)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 2e22ce5675ca..754ca5821266 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1955,10 +1955,12 @@ int find_suitable_fallback(struct free_area *area, unsigned int order,
> >  }
> >
> >  /*
> > - * Reserve a pageblock for exclusive use of high-order atomic allocations if
> > - * there are no empty page blocks that contain a page with a suitable order
> > + * Reserve the pageblock(s) surrounding an allocation request for
> > + * exclusive use of high-order atomic allocations if there are no
> > + * empty page blocks that contain a page with a suitable order
> >   */
> > -static void reserve_highatomic_pageblock(struct page *page, struct zone *zone)
> > +static void reserve_highatomic_pageblock(struct page *page, int order,
> > +					 struct zone *zone)
> >  {
> >  	int mt;
> >  	unsigned long max_managed, flags;
> > @@ -1984,10 +1986,17 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone)
> >  	/* Yoink! */
> >  	mt = get_pageblock_migratetype(page);
> >  	/* Only reserve normal pageblocks (i.e., they can merge with others) */
> > -	if (migratetype_is_mergeable(mt))
> > -		if (move_freepages_block(zone, page, mt,
> > -					 MIGRATE_HIGHATOMIC) != -1)
> > -			zone->nr_reserved_highatomic += pageblock_nr_pages;
> > +	if (!migratetype_is_mergeable(mt))
> > +		goto out_unlock;
> > +
> > +	if (order < pageblock_order) {
> > +		if (move_freepages_block(zone, page, mt, MIGRATE_HIGHATOMIC) == -1)
> > +			goto out_unlock;
> > +		zone->nr_reserved_highatomic += pageblock_nr_pages;
> > +	} else {
> > +		change_pageblock_range(page, order, MIGRATE_HIGHATOMIC);
> 
> Don't you also need to move (page, order) to MIGRATE_HIGHATOMIC free list here?

Reserving happens after rmqueue(), so the page is already removed from
the freelist. If the page is smaller than a pageblock, there might be
other buddies in the block that need to be moved, but if the page is
one or more whole pageblocks, there are no other buddies in the range.

> > +		zone->nr_reserved_highatomic += 1 << order;
> > +	}
> 
> You could do:
> 
> zone->nr_reserved_highatomic += max_t(unsigned long, pageblock_nr_pages, 1 << order);
> 
> instead. But I have no strong opinion on it.

Hm I played around with it, but since we have the if-else already I
figure it's a bit more readable to split the updates rather than to
have a more complicated max() conditional before/after.

I would prefer it in the unreserve_highatomic_pageblock() path as
well, but it still has that "migratetype change is racy" check, which
makes that impractical. That check seems no longer needed, though, now
that we take zone->lock for every type test/set, so I was planning to
send a follow-up to remove that.

> >  out_unlock:
> >  	spin_unlock_irqrestore(&zone->lock, flags);
> > @@ -1999,7 +2008,7 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone)
> >   * intense memory pressure but failed atomic allocations should be easier
> >   * to recover from than an OOM.
> >   *
> > - * If @force is true, try to unreserve a pageblock even though highatomic
> > + * If @force is true, try to unreserve pageblocks even though highatomic
> >   * pageblock is exhausted.
> >   */
> >  static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
> > @@ -2041,6 +2050,7 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
> >  			 * adjust the count once.
> >  			 */
> >  			if (is_migrate_highatomic(mt)) {
> > +				unsigned long size;
> >  				/*
> >  				 * It should never happen but changes to
> >  				 * locking could inadvertently allow a per-cpu
> > @@ -2048,9 +2058,9 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
> >  				 * while unreserving so be safe and watch for
> >  				 * underflows.
> >  				 */
> > -				zone->nr_reserved_highatomic -= min(
> > -						pageblock_nr_pages,
> > -						zone->nr_reserved_highatomic);
> > +				size = max(pageblock_nr_pages, 1UL << order);
> > +				size = min(size, zone->nr_reserved_highatomic);
> > +				zone->nr_reserved_highatomic -= size;
> >  			}
> >
> >  			/*
> > @@ -2062,11 +2072,19 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
> >  			 * of pageblocks that cannot be completely freed
> >  			 * may increase.
> >  			 */
> > -			ret = move_freepages_block(zone, page, mt,
> > -						   ac->migratetype);
> > +			if (order < pageblock_order)
> > +				ret = move_freepages_block(zone, page, mt,
> > +							   ac->migratetype);
> > +			else {
> > +				move_to_free_list(page, zone, order, mt,
> > +						  ac->migratetype);
> > +				change_pageblock_range(page, order,
> > +						       ac->migratetype);
> > +				ret = 1;
> > +			}
> >  			/*
> > -			 * Reserving this block already succeeded, so this should
> > -			 * not fail on zone boundaries.
> > +			 * Reserving the block(s) already succeeded,
> > +			 * so this should not fail on zone boundaries.
> >  			 */
> >  			WARN_ON_ONCE(ret == -1);
> >  			if (ret > 0) {
> > -- 
> > 2.45.1
> 
> This part looks good to me.

Thanks for taking a look!




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux