Re: [PATCH v1 1/2] mm/page_alloc: conditionally split > pageblock_order pages in free_one_page() and move_freepages_block_isolate()

David Hildenbrand <david@xxxxxxxxxx> · Mon, 9 Dec 2024 23:10:01 +0100

On 09.12.24 22:42, Zi Yan wrote:
On 9 Dec 2024, at 16:35, David Hildenbrand wrote:

On 09.12.24 20:23, Zi Yan wrote:
On 9 Dec 2024, at 14:01, Vlastimil Babka wrote:

On 12/6/24 10:59, David Hildenbrand wrote:
Let's special-case for the common scenarios that:

(a) We are freeing pages <= pageblock_order
(b) We are freeing a page <= MAX_PAGE_ORDER and all pageblocks match
      (especially, no mixture of isolated and non-isolated pageblocks)

Well in many of those cases we could also just adjust the pageblocks... But
perhaps they indeed shouldn't differ in the first place, unless there's an
isolation attempt.

When we encounter a > MAX_PAGE_ORDER page, it can only come from
alloc_contig_range(), and we can process MAX_PAGE_ORDER chunks.

When we encounter a >pageblock_order <= MAX_PAGE_ORDER page,
check whether all pageblocks match, and if so (common case), don't
split them up just for the buddy to merge them back.

This makes sure that when we free MAX_PAGE_ORDER chunks to the buddy,
for example during system startups, memory onlining, or when isolating
consecutive pageblocks via alloc_contig_range()/memory offlining, that
we don't unnecessarily split up what we'll immediately merge again,
because the migratetypes match.

Rename split_large_buddy() to __free_one_page_maybe_split(), to make it
clearer what's happening, and handle in it only natural buddy orders,
not the alloc_contig_range(__GFP_COMP) special case: handle that in
free_one_page() only.

Signed-off-by: David Hildenbrand <david@xxxxxxxxxx>

Acked-by: Vlastimil Babka <vbabka@xxxxxxx

Hm but noticed something:

+static void __free_one_page_maybe_split(struct zone *zone, struct page *page,
+		unsigned long pfn, int order, fpi_t fpi_flags)
+{
+	const unsigned long end_pfn = pfn + (1 << order);
+	int mt = get_pfnblock_migratetype(page, pfn);
+
+	VM_WARN_ON_ONCE(order > MAX_PAGE_ORDER);
   	VM_WARN_ON_ONCE(!IS_ALIGNED(pfn, 1 << order));
   	/* Caller removed page from freelist, buddy info cleared! */
   	VM_WARN_ON_ONCE(PageBuddy(page));

-	if (order > pageblock_order)
-		order = pageblock_order;
-
-	while (pfn != end) {
-		int mt = get_pfnblock_migratetype(page, pfn);
+	/*
+	 * With CONFIG_MEMORY_ISOLATION, we might be freeing MAX_ORDER_NR_PAGES
+	 * pages that cover pageblocks with different migratetypes; for example
+	 * only some migratetypes might be MIGRATE_ISOLATE. In that (unlikely)
+	 * case, fallback to freeing individual pageblocks so they get put
+	 * onto the right lists.
+	 */
+	if (!IS_ENABLED(CONFIG_MEMORY_ISOLATION) ||
+	    likely(order <= pageblock_order) ||
+	    pfnblock_migratetype_equal(pfn + pageblock_nr_pages, end_pfn, mt)) {
+		__free_one_page(page, pfn, zone, order, mt, fpi_flags);
+		return;
+	}

-		__free_one_page(page, pfn, zone, order, mt, fpi);
-		pfn += 1 << order;
+	while (pfn != end_pfn) {
+		mt = get_pfnblock_migratetype(page, pfn);
+		__free_one_page(page, pfn, zone, pageblock_order, mt, fpi_flags);
+		pfn += pageblock_nr_pages;
   		page = pfn_to_page(pfn);

This predates your patch, but seems potentially dangerous to attempt
pfn_to_page(end_pfn) with SPARSEMEM and no vmemmap and the end_pfn perhaps
being just outside of the valid range? Should we change that?

But seems this code was initially introduced as part of Johannes'
migratetype hygiene series.

It starts as split_free_page() from commit b2c9e2fbba32 ("mm: make
alloc_contig_range work at pageblock granularity”), but harmless since
it is only used to split a buddy page. Then commit fd919a85cd55 ("mm:
page_isolation: prepare for hygienic freelists") refactored it, which
should be fine, since it is still used for the same purpose in page
isolation. Then commit e98337d11bbd ("mm/contig_alloc: support __GFP_COMP")
used it for gigantic hugetlb.

For SPARSEMEM && !SPARSEMEM_VMEMMAP, PFNs are contiguous, vmemmap might not
be. The code above using pfn in the loop might be fine. And since order
is provided, unless the caller is providing a falsely large order, pfn
should be valid. Or am I missing anything?

I think the question is, what happens when we call pfn_to_page() on a PFN that falls into a memory section that is either offline, doesn't have a memmap, or does not exist.

With CONFIG_SPARSEMEM, we do a

struct mem_section *__sec = __pfn_to_section(__pfn)
__section_mem_map_addr(__sec) + __pfn;

__pfn_to_section() can return NULL, in which case __section_mem_map_addr() would dereference NULL.

I assume it ould happen in corner cases, if we'd exceed NR_SECTION_ROOTS. (IOW, large memory, and we free a page that is at the very end of physical memory).

Likely, we should do the pfn_to_page() before the __free_one_page() call.

Got it. Both you and Vlastimil gave the same corner case issue.
I agree that doing pfn_to_page() before the __free_one_page() could get rid of
the concern.

Thanks you both for the review. I'll resend a v2 tomorrow, including a 
patch to fix that up first.

--
Cheers,

David / dhildenb