Re: page type is 3, passed migratetype is 1 (nr=512)

Zi Yan <ziy@xxxxxxxxxx> · Wed, 29 May 2024 21:51:24 -0400

On 29 May 2024, at 21:04, Johannes Weiner wrote:

> On Wed, May 29, 2024 at 12:28:35PM -0400, Johannes Weiner wrote:
>> Thanks for the config, I was able to reproduce it with.
>>
>> On Tue, May 28, 2024 at 12:48:05PM -0400, Johannes Weiner wrote:
>>> Ah, but there DOES seem to be an issue with how we reserve
>>> highatomics: reserving and unreserving happens one pageblock at a
>>> time, but MAX_ORDER is usually bigger. If we rmqueue() an order-10
>>> request, reserve_highatomic_block() will only convert the first
>>> order-9 block in it; the tail will remain the original type, which
>>> will produce a buddy of mixed type blocks upon freeing.
>>>
>>> This doesn't fully explain the warning here. We'd expect to see it the
>>> other way round - passing an assumed type of 3 (HIGHATOMIC) for the
>>> remainder that is actually 1 (MOVABLE). But the pageblock-based
>>> reservations look fishy. I'll cook up a patch to make this
>>> range-based. It might just fix it in a way I'm not seeing just yet.
>>
>> tl;dr: With some debugging printks, I was able to see the
>> issue. Should be a straight-forward fix.
>>
>> No order-10 allocations are necessary. Instead, smaller allocations
>> grab blocks for the highatomic pool. Unreserving is lazy, so as those
>> allocations get freed, they have a chance to merge together. Two
>> adjacent highatomic blocks can merge (MAX_ORDER > pageblock_order). On
>> unreserve, we now have an order-10 highatomic buddy, but only clear
>> the type on the first order-9 pageblock. A subsequent alloc + expand
>> will warn about this type inconsistency.
>>
>> [  411.188518] UNRESERVE: pfn=26000 order=10
>> [  411.188739] ------------[ cut here ]------------
>> [  411.188881] 26200: page type is 3, passed migratetype is 1 (nr=512)
>> [  411.189097] WARNING: CPU: 1 PID: 10152 at mm/page_alloc.c:645 expand+0x1c8/0x1f0
>>
>> I have a draft patch to make the highatomic reservations update all
>> blocks inside the range, not just the first one. I'll send it out as
>> soon as I have tested it properly.
>
> The below fixes the issue for me and survives a few hours of
> xfstest/generic/176 for me.
>
> Christoph, can you please confirm it addresses what you saw?
>
> Vlastimil, Zi, Mel, can you please take a look?
>
> Thanks!
>
> ---
>
> From a0ba36bb9cd7404c07c7a56139fd52efc6981ced Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@xxxxxxxxxxx>
> Date: Wed, 29 May 2024 18:18:12 -0400
> Subject: [PATCH] mm: page_alloc: fix highatomic typing in multi-block buddies
>
> Christoph reports a page allocator splat triggered by xfstests:
>
> generic/176 214s ... [ 1204.507931] run fstests generic/176 at 2024-05-27 12:52:30
> [] XFS (nvme0n1): Mounting V5 Filesystem cd936307-415f-48a3-b99d-a2d52ae1f273
> [] XFS (nvme0n1): Ending clean mount
> [] XFS (nvme1n1): Mounting V5 Filesystem ab3ee1a4-af62-4934-9a6a-6c2fde321850
> [] XFS (nvme1n1): Ending clean mount
> [] XFS (nvme1n1): Unmounting Filesystem ab3ee1a4-af62-4934-9a6a-6c2fde321850
> [] XFS (nvme1n1): Mounting V5 Filesystem 7099b02d-9c58-4d1d-be1d-2cc472d12cd9
> [] XFS (nvme1n1): Ending clean mount
> [] ------------[ cut here ]------------
> [] page type is 3, passed migratetype is 1 (nr=512)
> [] WARNING: CPU: 0 PID: 509870 at mm/page_alloc.c:645 expand+0x1c5/0x1f0
> [] Modules linked in: i2c_i801 crc32_pclmul i2c_smbus [last unloaded: scsi_debug]
> [] CPU: 0 PID: 509870 Comm: xfs_io Not tainted 6.10.0-rc1+ #2437
> [] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> [] RIP: 0010:expand+0x1c5/0x1f0
> [] Code: 05 16 70 bf 02 01 e8 ca fc ff ff 8b 54 24 34 44 89 e1 48 c7 c7 80 a2 28 83 48 89 c6 b8 01 00 3
> [] RSP: 0018:ffffc90003b2b968 EFLAGS: 00010082
> [] RAX: 0000000000000000 RBX: ffffffff83fa9480 RCX: 0000000000000000
> [] RDX: 0000000000000005 RSI: 0000000000000027 RDI: 00000000ffffffff
> [] RBP: 00000000001f2600 R08: 00000000fffeffff R09: 0000000000000001
> [] R10: 0000000000000000 R11: ffffffff83676200 R12: 0000000000000009
> [] R13: 0000000000000200 R14: 0000000000000001 R15: ffffea0007c98000
> [] FS:  00007f72ca3d5780(0000) GS:ffff8881f9c00000(0000) knlGS:0000000000000000
> [] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [] CR2: 00007f72ca1fff38 CR3: 00000001aa0c6002 CR4: 0000000000770ef0
> [] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
> [] PKRU: 55555554
> [] Call Trace:
> []  <TASK>
> []  ? __warn+0x7b/0x120
> []  ? expand+0x1c5/0x1f0
> []  ? report_bug+0x191/0x1c0
> []  ? handle_bug+0x3c/0x80
> []  ? exc_invalid_op+0x17/0x70
> []  ? asm_exc_invalid_op+0x1a/0x20
> []  ? expand+0x1c5/0x1f0
> []  ? expand+0x1c5/0x1f0
> []  __rmqueue_pcplist+0x3a9/0x730
> []  get_page_from_freelist+0x7a0/0xf00
> []  __alloc_pages_noprof+0x153/0x2e0
> []  __folio_alloc_noprof+0x10/0xa0
> []  __filemap_get_folio+0x16b/0x370
> []  iomap_write_begin+0x496/0x680
>
> While trying to service a movable allocation (page type 1), the page
> allocator runs into a two-pageblock buddy on the movable freelist
> whose second block is typed as highatomic (page type 3).
>
> This inconsistency is caused by the highatomic reservation system
> operating on single pageblocks, while MAX_ORDER can be bigger than
> that - in this configuration, pageblock_order is 9 while
> MAX_PAGE_ORDER is 10. The test case is observed to make several
> adjacent order-3 requests with __GFP_DIRECT_RECLAIM cleared, which
> marks the surrounding block as highatomic. Upon freeing, the blocks
> merge into an order-10 buddy. When the highatomic pool is drained
> later on, this order-10 buddy gets moved back to the movable list, but
> only the first pageblock is marked movable again. A subsequent
> expand() of this buddy warns about the tail being of a different type.
>
> A similar warning could trigger on an order-10 request that only marks
> the first pageblock as highatomic and leaves the second one movable.
>
> This is a long-standing bug that's surfaced by the recent block type
> warnings added to the allocator. The consequences seem mostly benign,
> it just results in odd behavior: the highatomic tail blocks are not
> properly drained, instead they end up on the movable list first, then
> go back to the highatomic list after an alloc-free cycle.
>
> To fix this, make the highatomic reservation code aware that
> allocations/buddies can be larger than a pageblock.
>
> While it's an old quirk, the recently added type consistency warnings
> seem to be the most prominent consequence of it. Set the Fixes: tag
> accordingly to highlight this backporting dependency.
>
> Fixes: e0932b6c1f94 ("mm: page_alloc: consolidate free page accounting")
> Reported-by: Christoph Hellwig <hch@xxxxxxxxxxxxx>
> Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx>
> ---
>  mm/page_alloc.c | 48 +++++++++++++++++++++++++++++++++---------------
>  1 file changed, 33 insertions(+), 15 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 2e22ce5675ca..754ca5821266 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1955,10 +1955,12 @@ int find_suitable_fallback(struct free_area *area, unsigned int order,
>  }
>
>  /*
> - * Reserve a pageblock for exclusive use of high-order atomic allocations if
> - * there are no empty page blocks that contain a page with a suitable order
> + * Reserve the pageblock(s) surrounding an allocation request for
> + * exclusive use of high-order atomic allocations if there are no
> + * empty page blocks that contain a page with a suitable order
>   */
> -static void reserve_highatomic_pageblock(struct page *page, struct zone *zone)
> +static void reserve_highatomic_pageblock(struct page *page, int order,
> +					 struct zone *zone)
>  {
>  	int mt;
>  	unsigned long max_managed, flags;
> @@ -1984,10 +1986,17 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone)
>  	/* Yoink! */
>  	mt = get_pageblock_migratetype(page);
>  	/* Only reserve normal pageblocks (i.e., they can merge with others) */
> -	if (migratetype_is_mergeable(mt))
> -		if (move_freepages_block(zone, page, mt,
> -					 MIGRATE_HIGHATOMIC) != -1)
> -			zone->nr_reserved_highatomic += pageblock_nr_pages;
> +	if (!migratetype_is_mergeable(mt))
> +		goto out_unlock;
> +
> +	if (order < pageblock_order) {
> +		if (move_freepages_block(zone, page, mt, MIGRATE_HIGHATOMIC) == -1)
> +			goto out_unlock;
> +		zone->nr_reserved_highatomic += pageblock_nr_pages;
> +	} else {
> +		change_pageblock_range(page, order, MIGRATE_HIGHATOMIC);

Don't you also need to move (page, order) to MIGRATE_HIGHATOMIC free list here?

> +		zone->nr_reserved_highatomic += 1 << order;
> +	}

You could do:

zone->nr_reserved_highatomic += max_t(unsigned long, pageblock_nr_pages, 1 << order);

instead. But I have no strong opinion on it.

>
>  out_unlock:
>  	spin_unlock_irqrestore(&zone->lock, flags);
> @@ -1999,7 +2008,7 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone)
>   * intense memory pressure but failed atomic allocations should be easier
>   * to recover from than an OOM.
>   *
> - * If @force is true, try to unreserve a pageblock even though highatomic
> + * If @force is true, try to unreserve pageblocks even though highatomic
>   * pageblock is exhausted.
>   */
>  static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
> @@ -2041,6 +2050,7 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
>  			 * adjust the count once.
>  			 */
>  			if (is_migrate_highatomic(mt)) {
> +				unsigned long size;
>  				/*
>  				 * It should never happen but changes to
>  				 * locking could inadvertently allow a per-cpu
> @@ -2048,9 +2058,9 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
>  				 * while unreserving so be safe and watch for
>  				 * underflows.
>  				 */
> -				zone->nr_reserved_highatomic -= min(
> -						pageblock_nr_pages,
> -						zone->nr_reserved_highatomic);
> +				size = max(pageblock_nr_pages, 1UL << order);
> +				size = min(size, zone->nr_reserved_highatomic);
> +				zone->nr_reserved_highatomic -= size;
>  			}
>
>  			/*
> @@ -2062,11 +2072,19 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
>  			 * of pageblocks that cannot be completely freed
>  			 * may increase.
>  			 */
> -			ret = move_freepages_block(zone, page, mt,
> -						   ac->migratetype);
> +			if (order < pageblock_order)
> +				ret = move_freepages_block(zone, page, mt,
> +							   ac->migratetype);
> +			else {
> +				move_to_free_list(page, zone, order, mt,
> +						  ac->migratetype);
> +				change_pageblock_range(page, order,
> +						       ac->migratetype);
> +				ret = 1;
> +			}
>  			/*
> -			 * Reserving this block already succeeded, so this should
> -			 * not fail on zone boundaries.
> +			 * Reserving the block(s) already succeeded,
> +			 * so this should not fail on zone boundaries.
>  			 */
>  			WARN_ON_ONCE(ret == -1);
>  			if (ret > 0) {
> -- 
> 2.45.1

This part looks good to me.

--
Best Regards,
Yan, Zi
Attachment:
signature.asc

Description: OpenPGP digital signature