[merged mm-stable] mm-khugepaged-dont-carry-huge-page-to-the-next-loop-for-config_numa.patch removed from -mm tree

Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> · Sun, 11 Sep 2022 20:27:54 -0700

The quilt patch titled
     Subject: mm: khugepaged: don't carry huge page to the next loop for !CONFIG_NUMA
has been removed from the -mm tree.  Its filename was
     mm-khugepaged-dont-carry-huge-page-to-the-next-loop-for-config_numa.patch

This patch was dropped because it was merged into the mm-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

------------------------------------------------------
From: Yang Shi <shy828301@xxxxxxxxx>
Subject: mm: khugepaged: don't carry huge page to the next loop for !CONFIG_NUMA
Date: Wed, 6 Jul 2022 16:59:20 -0700

Patch series "mm: userspace hugepage collapse", v7.

Introduction
--------------------------------

This series provides a mechanism for userspace to induce a collapse of
eligible ranges of memory into transparent hugepages in process context,
thus permitting users to more tightly control their own hugepage
utilization policy at their own expense.

This idea was introduced by David Rientjes[5].

Interface
--------------------------------

The proposed interface adds a new madvise(2) mode, MADV_COLLAPSE, and
leverages the new process_madvise(2) call.

process_madvise(2)

	Performs a synchronous collapse of the native pages
	mapped by the list of iovecs into transparent hugepages.

	This operation is independent of the system THP sysfs settings,
	but attempts to collapse VMAs marked VM_NOHUGEPAGE will still fail.

	THP allocation may enter direct reclaim and/or compaction.

	When a range spans multiple VMAs, the semantics of the collapse
	over of each VMA is independent from the others.

	Caller must have CAP_SYS_ADMIN if not acting on self.

	Return value follows existing process_madvise(2) conventions.  A
	â??successâ?? indicates that all hugepage-sized/aligned regions
	covered by the provided range were either successfully
	collapsed, or were already pmd-mapped THPs.

madvise(2)

	Equivalent to process_madvise(2) on self, with 0 returned on
	â??successâ??.

Current Use-Cases
--------------------------------

(1)	Immediately back executable text by THPs.  Current support provided
	by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large
	system which might impair services from serving at their full rated
	load after (re)starting.  Tricks like mremap(2)'ing text onto
	anonymous memory to immediately realize iTLB performance prevents
	page sharing and demand paging, both of which increase steady state
	memory footprint.  With MADV_COLLAPSE, we get the best of both
	worlds: Peak upfront performance and lower RAM footprints.  Note
	that subsequent support for file-backed memory is required here.

(2)	malloc() implementations that manage memory in hugepage-sized
	chunks, but sometimes subrelease memory back to the system in
	native-sized chunks via MADV_DONTNEED; zapping the pmd.  Later,
	when the memory is hot, the implementation could
	madvise(MADV_COLLAPSE) to re-back the memory by THPs to regain
	hugepage coverage and dTLB performance.  TCMalloc is such an
	implementation that could benefit from this[6].  A prior study of
	Google internal workloads during evaluation of Temeraire, a
	hugepage-aware enhancement to TCMalloc, showed that nearly 20% of
	all cpu cycles were spent in dTLB stalls, and that increasing
	hugepage coverage by even small amount can help with that[7].

(3)	userfaultfd-based live migration of virtual machines satisfy UFFD
	faults by fetching native-sized pages over the network (to avoid
	latency of transferring an entire hugepage).  However, after guest
	memory has been fully copied to the new host, MADV_COLLAPSE can
	be used to immediately increase guest performance.  Note that
	subsequent support for file/shmem-backed memory is required here.

(4)	HugeTLB high-granularity mapping allows HugeTLB a HugeTLB page to
	be mapped at different levels in the page tables[8].  As it's not
	"transparent" like THP, HugeTLB high-granularity mappings require
	an explicit user API. It is intended that MADV_COLLAPSE be co-opted
	for this use case[9].  Note that subsequent support for HugeTLB
	memory is required here.

Future work
--------------------------------

Only private anonymous memory is supported by this series. File and
shmem memory support will be added later.

One possible user of this functionality is a userspace agent that
attempts to optimize THP utilization system-wide by allocating THPs
based on, for example, task priority, task performance requirements, or
heatmaps.  For the latter, one idea that has already surfaced is using
DAMON to identify hot regions, and driving THP collapse through a new
DAMOS_COLLAPSE scheme[10].


This patch (of 17):

The khugepaged has optimization to reduce huge page allocation calls for
!CONFIG_NUMA by carrying the allocated but failed to collapse huge page to
the next loop.  CONFIG_NUMA doesn't do so since the next loop may try to
collapse huge page from a different node, so it doesn't make too much
sense to carry it.

But when NUMA=n, the huge page is allocated by khugepaged_prealloc_page()
before scanning the address space, so it means huge page may be allocated
even though there is no suitable range for collapsing.  Then the page
would be just freed if khugepaged already made enough progress.  This
could make NUMA=n run have 5 times as much thp_collapse_alloc as NUMA=y
run.  This problem actually makes things worse due to the way more
pointless THP allocations and makes the optimization pointless.

This could be fixed by carrying the huge page across scans, but it will
complicate the code further and the huge page may be carried indefinitely.
But if we take one step back, the optimization itself seems not worth
keeping nowadays since:

  * Not too many users build NUMA=n kernel nowadays even though the kernel is
    actually running on a non-NUMA machine. Some small devices may run NUMA=n
    kernel, but I don't think they actually use THP.
  * Since commit 44042b449872 ("mm/page_alloc: allow high-order pages to be
    stored on the per-cpu lists"), THP could be cached by pcp.  This actually
    somehow does the job done by the optimization.

Link: https://lkml.kernel.org/r/20220706235936.2197195-1-zokeefe@xxxxxxxxxx
Link: https://lkml.kernel.org/r/20220706235936.2197195-3-zokeefe@xxxxxxxxxx
Signed-off-by: Yang Shi <shy828301@xxxxxxxxx>
Signed-off-by: Zach O'Keefe <zokeefe@xxxxxxxxxx>
Co-developed-by: Peter Xu <peterx@xxxxxxxxxx>
Signed-off-by: Peter Xu <peterx@xxxxxxxxxx>
Cc: Hugh Dickins <hughd@xxxxxxxxxx>
Cc: "Kirill A. Shutemov" <kirill.shutemov@xxxxxxxxxxxxxxx>
Cc: Alex Shi <alex.shi@xxxxxxxxxxxxxxxxx>
Cc: Andrea Arcangeli <aarcange@xxxxxxxxxx>
Cc: Arnd Bergmann <arnd@xxxxxxxx>
Cc: Axel Rasmussen <axelrasmussen@xxxxxxxxxx>
Cc: Chris Kennelly <ckennelly@xxxxxxxxxx>
Cc: Chris Zankel <chris@xxxxxxxxxx>
Cc: David Hildenbrand <david@xxxxxxxxxx>
Cc: David Rientjes <rientjes@xxxxxxxxxx>
Cc: Helge Deller <deller@xxxxxx>
Cc: Ivan Kokshaysky <ink@xxxxxxxxxxxxxxxxxxxx>
Cc: James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx>
Cc: Jens Axboe <axboe@xxxxxxxxx>
Cc: Matthew Wilcox <willy@xxxxxxxxxxxxx>
Cc: Matt Turner <mattst88@xxxxxxxxx>
Cc: Max Filippov <jcmvbkbc@xxxxxxxxx>
Cc: Miaohe Lin <linmiaohe@xxxxxxxxxx>
Cc: Michal Hocko <mhocko@xxxxxxxx>
Cc: Minchan Kim <minchan@xxxxxxxxxx>
Cc: Pasha Tatashin <pasha.tatashin@xxxxxxxxxx>
Cc: Pavel Begunkov <asml.silence@xxxxxxxxx>
Cc: Rongwei Wang <rongwei.wang@xxxxxxxxxxxxxxxxx>
Cc: SeongJae Park <sj@xxxxxxxxxx>
Cc: Song Liu <songliubraving@xxxxxx>
Cc: Thomas Bogendoerfer <tsbogend@xxxxxxxxxxxxxxxx>
Cc: Vlastimil Babka <vbabka@xxxxxxx>
Cc: Zi Yan <ziy@xxxxxxxxxx>
Cc: Dan Carpenter <dan.carpenter@xxxxxxxxxx>
Cc: "Souptick Joarder (HPE)" <jrdr.linux@xxxxxxxxx>
Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
---

 mm/khugepaged.c |  120 +++++++++-------------------------------------
 1 file changed, 26 insertions(+), 94 deletions(-)

--- a/mm/khugepaged.c~mm-khugepaged-dont-carry-huge-page-to-the-next-loop-for-config_numa
+++ a/mm/khugepaged.c
@@ -796,29 +796,16 @@ static int khugepaged_find_target_node(v
 	last_khugepaged_target_node = target_node;
 	return target_node;
 }
-
-static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
+#else
+static int khugepaged_find_target_node(void)
 {
-	if (IS_ERR(*hpage)) {
-		if (!*wait)
-			return false;
-
-		*wait = false;
-		*hpage = NULL;
-		khugepaged_alloc_sleep();
-	} else if (*hpage) {
-		put_page(*hpage);
-		*hpage = NULL;
-	}
-
-	return true;
+	return 0;
 }
+#endif
 
 static struct page *
 khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
 {
-	VM_BUG_ON_PAGE(*hpage, *hpage);
-
 	*hpage = __alloc_pages_node(node, gfp, HPAGE_PMD_ORDER);
 	if (unlikely(!*hpage)) {
 		count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
@@ -830,74 +817,6 @@ khugepaged_alloc_page(struct page **hpag
 	count_vm_event(THP_COLLAPSE_ALLOC);
 	return *hpage;
 }
-#else
-static int khugepaged_find_target_node(void)
-{
-	return 0;
-}
-
-static inline struct page *alloc_khugepaged_hugepage(void)
-{
-	struct page *page;
-
-	page = alloc_pages(alloc_hugepage_khugepaged_gfpmask(),
-			   HPAGE_PMD_ORDER);
-	if (page)
-		prep_transhuge_page(page);
-	return page;
-}
-
-static struct page *khugepaged_alloc_hugepage(bool *wait)
-{
-	struct page *hpage;
-
-	do {
-		hpage = alloc_khugepaged_hugepage();
-		if (!hpage) {
-			count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
-			if (!*wait)
-				return NULL;
-
-			*wait = false;
-			khugepaged_alloc_sleep();
-		} else
-			count_vm_event(THP_COLLAPSE_ALLOC);
-	} while (unlikely(!hpage) && likely(hugepage_flags_enabled()));
-
-	return hpage;
-}
-
-static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
-{
-	/*
-	 * If the hpage allocated earlier was briefly exposed in page cache
-	 * before collapse_file() failed, it is possible that racing lookups
-	 * have not yet completed, and would then be unpleasantly surprised by
-	 * finding the hpage reused for the same mapping at a different offset.
-	 * Just release the previous allocation if there is any danger of that.
-	 */
-	if (*hpage && page_count(*hpage) > 1) {
-		put_page(*hpage);
-		*hpage = NULL;
-	}
-
-	if (!*hpage)
-		*hpage = khugepaged_alloc_hugepage(wait);
-
-	if (unlikely(!*hpage))
-		return false;
-
-	return true;
-}
-
-static struct page *
-khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
-{
-	VM_BUG_ON(!*hpage);
-
-	return  *hpage;
-}
-#endif
 
 /*
  * If mmap_lock temporarily dropped, revalidate vma
@@ -1150,8 +1069,10 @@ static void collapse_huge_page(struct mm
 out_up_write:
 	mmap_write_unlock(mm);
 out_nolock:
-	if (!IS_ERR_OR_NULL(*hpage))
+	if (!IS_ERR_OR_NULL(*hpage)) {
 		mem_cgroup_uncharge(page_folio(*hpage));
+		put_page(*hpage);
+	}
 	trace_mm_collapse_huge_page(mm, isolated, result);
 	return;
 }
@@ -1953,8 +1874,10 @@ xa_unlocked:
 	unlock_page(new_page);
 out:
 	VM_BUG_ON(!list_empty(&pagelist));
-	if (!IS_ERR_OR_NULL(*hpage))
+	if (!IS_ERR_OR_NULL(*hpage)) {
 		mem_cgroup_uncharge(page_folio(*hpage));
+		put_page(*hpage);
+	}
 	/* TODO: tracepoints */
 }
 
@@ -2194,10 +2117,7 @@ static void khugepaged_do_scan(void)
 
 	lru_add_drain_all();
 
-	while (progress < pages) {
-		if (!khugepaged_prealloc_page(&hpage, &wait))
-			break;
-
+	while (true) {
 		cond_resched();
 
 		if (unlikely(kthread_should_stop() || try_to_freeze()))
@@ -2213,10 +2133,22 @@ static void khugepaged_do_scan(void)
 		else
 			progress = pages;
 		spin_unlock(&khugepaged_mm_lock);
-	}
 
-	if (!IS_ERR_OR_NULL(hpage))
-		put_page(hpage);
+		if (progress >= pages)
+			break;
+
+		if (IS_ERR(hpage)) {
+			/*
+			 * If fail to allocate the first time, try to sleep for
+			 * a while.  When hit again, cancel the scan.
+			 */
+			if (!wait)
+				break;
+			wait = false;
+			hpage = NULL;
+			khugepaged_alloc_sleep();
+		}
+	}
 }
 
 static bool khugepaged_should_wakeup(void)
_

Patches currently in -mm which might be from shy828301@xxxxxxxxx are

mm-gup-fix-the-fast-gup-race-against-thp-collapse.patch
powerpc-64s-radix-dont-need-to-broadcast-ipi-for-radix-pmd-collapse-flush.patch