Re: [PATCH mm-unstable v2 1/3] mm/contig_alloc: support __GFP_COMP

David Hildenbrand <david@xxxxxxxxxx> · Wed, 20 Nov 2024 17:13:19 +0100

On 20.11.24 16:55, Zi Yan wrote:
On 19 Nov 2024, at 11:52, David Hildenbrand wrote:

On 19.11.24 17:49, David Hildenbrand wrote:
On 19.11.24 17:41, Zi Yan wrote:
On 19 Nov 2024, at 11:31, David Hildenbrand wrote:

On 19.11.24 17:12, Zi Yan wrote:
On 19 Nov 2024, at 10:29, David Hildenbrand wrote:

+/* Split a multi-block free page into its individual pageblocks. */
+static void split_large_buddy(struct zone *zone, struct page *page,
+			      unsigned long pfn, int order, fpi_t fpi)
+{
+	unsigned long end = pfn + (1 << order);
+
+	VM_WARN_ON_ONCE(!IS_ALIGNED(pfn, 1 << order));
+	/* Caller removed page from freelist, buddy info cleared! */
+	VM_WARN_ON_ONCE(PageBuddy(page));
+
+	if (order > pageblock_order)
+		order = pageblock_order;
+
+	while (pfn != end) {
+		int mt = get_pfnblock_migratetype(page, pfn);
+
+		__free_one_page(page, pfn, zone, order, mt, fpi);
+		pfn += 1 << order;
+		page = pfn_to_page(pfn);
+	}
+}

Hi,

stumbling over this while digging through the code ....

+
      static void free_one_page(struct zone *zone, struct page *page,
      			  unsigned long pfn, unsigned int order,
      			  fpi_t fpi_flags)
      {
      	unsigned long flags;
-	int migratetype;
       	spin_lock_irqsave(&zone->lock, flags);
-	migratetype = get_pfnblock_migratetype(page, pfn);
-	__free_one_page(page, pfn, zone, order, migratetype, fpi_flags);

This change is rather undesired:

via __free_pages_core()->__free_pages_ok() we can easily end up here with order=MAX_PAGE_ORDER.

Do you have a concrete example? PMD THP on x86_64 is pageblock_order.
We do not have PMD level mTHP yet. Any other possible source?


What your new code will do is split this perfectly reasonable MAX_PAGE_ORDER chunk via split_large_buddy() into pageblock-sized chunks, and let the buddy merging logic undo our unnecessary splitting.

Is there a way to avoid this and just process the whole MAX_PAGE_ORDER chunk like we used to?

Probably split_large_buddy() can check the migratetypes of the to-be-freed
page, if order > pageblock_order. If all migratetypes are the same, the page
can be freed at MAX_PAGE_ORDER, otherwise pageblock_order.

Thinking about this: why do we care about the migratetype?

We only have to fallback to pageblocks if any pageblock is "MIGRATE_ISOLATE" (and maybe MIGRATE_CMA), but not all. Otherwise, we can just ignore the migratetype (or rather overwrite it)

There are VM_WARN_ONCEs around *_free_list() operations to make sure
page migratetype matches the migratetype of the list it is on. Ignoring
migratetype would trigger these WARNs. Overwriting it would work but
free page migratetype accounting needs to be taken care of.

An implicit reason is that __free_one_page() does not support >MAX_PAGE_ORDER
and gigantic hugetlb folios are freed via free_one_page(), where
split_large_buddy() is used to work around the limitation.

Yes, I saw that change. But that can be easily identified cased by
unlikely(order > MAX_PAGE_ORDER) to handle that rare case in a special way.

   > > For the two memory init cases you mentioned in the other email,
maybe a new
fpi flag to make free_one_page() use __free_one_page() for them,
since migratetypes should be the same across MAX_PAGE_ORDER range there, right?

In the context of alloc_frozen_range()/free_frozen_range() I want to
free MAX_PAGE_ORDER chunks when possible, and not have some odd logic in
the freeing path undo some of that effort.

Adding a pointer to that discussion:

https://lkml.kernel.org/r/ZzZdnuZBf-xgiwD2@xxxxxxxxxxxxxxxxxxxx

Thanks.

So you are thinking about something like this:

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b6958333054d..3d3341dc1ad1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1254,7 +1254,12 @@ static void free_one_page(struct zone *zone, struct page *page,
  	unsigned long flags;

  	spin_lock_irqsave(&zone->lock, flags);
-	split_large_buddy(zone, page, pfn, order, fpi_flags);
+	if (unlikely(order > MAX_PAGE_ORDER))
+		split_large_buddy(zone, page, pfn, order, fpi_flags);> +	else {
+		int migratetype = get_pfnblock_migratetype(page, pfn);
+		__free_one_page(page, pfn, zone, order, migratetype, fpi_flags);
+	}
  	spin_unlock_irqrestore(&zone->lock, flags);

  	__count_vm_events(PGFREE, 1 << order);


Something in that direction, but likely something like:

if (likely(order <= pageblock_order)) {
	int migratetype = get_pfnblock_migratetype(page, pfn);
	__free_one_page(page, pfn, zone, order, migratetype, fpi_flags);
} else if (order <= MAX_PAGE_ORDER) {
	/* We might have to split when some pageblocks differ. */
	maybe_split_large_buddy(zone, page, pfn, order, fpi_flags);
} else {
	/* The buddy works in max MAX_PAGE_ORDER chunks. */
	for /* each MAX_PAGE_ORDER chunk */
		maybe_split_large_buddy(zone, page, pfn, MAX_PAGE_ORDER, fpi_flags);
		pfn += MAX_ORDER_NR_PAGES;
		page = ...
	}	
}

Whereby maybe_split_large_buddy would check if it has to do anything with the
pageblocks, or whether it can just call __free_one_page(order). It would only
accept order <=

In the future with free_frozen_pages(), the last else case would vanish.


Is it possible to have a MAX_PAGE_ORDER hugetlb folio? If no, we are good.

Hm, I wouldn't rule it out. To be future proof, we'd likely should add a
way to handle it. Shouldn't be too hard and expensive: simply read all
pageblock migratetypes.

I wonder if there are configs (no page isolation?) where we don't have to
check anything.

If yes, alloc_contig_range() can change the migratetype of one of half that folio
and during migration phase, that folio will be freed via __free_one_page()
and causing migratetype mismatch.

Can you remind me how that happens?

Likely, we isolate a single pageblock, and shortly after we free a larger
page that covers multiple pageblocks? So this could affect other
alloc_contig_range() users as well, I assume?

Thanks!

--
Cheers,

David / dhildenb