Re: [PATCH V3] mm: compaction: skip memory compaction when there are not enough migratable pages

Ge Yang <yangge1116@xxxxxxx> · Tue, 14 Jan 2025 20:24:34 +0800

在 2025/1/14 19:21, Vlastimil Babka 写道:
On 1/8/25 12:30, yangge1116@xxxxxxx wrote:
From: yangge <yangge1116@xxxxxxx>

There are 4 NUMA nodes on my machine, and each NUMA node has 32GB
of memory. I have configured 16GB of CMA memory on each NUMA node,
and starting a 32GB virtual machine with device passthrough is
extremely slow, taking almost an hour.

During the start-up of the virtual machine, it will call
pin_user_pages_remote(..., FOLL_LONGTERM, ...) to allocate memory.
Long term GUP cannot allocate memory from CMA area, so a maximum of
16 GB of no-CMA memory on a NUMA node can be used as virtual machine
memory. There is 16GB of free CMA memory on a NUMA node, which is
sufficient to pass the order-0 watermark check, causing the
__compaction_suitable() function to  consistently return true.
However, if there aren't enough migratable pages available, performing
memory compaction is also meaningless. Besides checking whether
the order-0 watermark is met, __compaction_suitable() also needs
to determine whether there are sufficient migratable pages available
for memory compaction.

For costly allocations, because __compaction_suitable() always
returns true, __alloc_pages_slowpath() can't exit at the appropriate
place, resulting in excessively long virtual machine startup times.
Call trace:
__alloc_pages_slowpath
     if (compact_result == COMPACT_SKIPPED ||
         compact_result == COMPACT_DEFERRED)
         goto nopage; // should exit __alloc_pages_slowpath() from here

When the 16G of non-CMA memory on a single node is exhausted, we will
fallback to allocating memory on other nodes. In order to quickly
fallback to remote nodes, we should skip memory compaction when
migratable pages are insufficient. After this fix, it only takes a
few tens of seconds to start a 32GB virtual machine with device
passthrough functionality.

Signed-off-by: yangge <yangge1116@xxxxxxx>
---

V3:
- fix build error

V2:
- consider unevictable folios

  mm/compaction.c | 20 ++++++++++++++++++++
  1 file changed, 20 insertions(+)

diff --git a/mm/compaction.c b/mm/compaction.c
index 07bd227..a9f1261 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -2383,7 +2383,27 @@ static bool __compaction_suitable(struct zone *zone, int order,
  				  int highest_zoneidx,
  				  unsigned long wmark_target)
  {
+	pg_data_t __maybe_unused *pgdat = zone->zone_pgdat;
+	unsigned long sum, nr_pinned;
  	unsigned long watermark;
+
+	sum = node_page_state(pgdat, NR_INACTIVE_FILE) +
+		node_page_state(pgdat, NR_INACTIVE_ANON) +
+		node_page_state(pgdat, NR_ACTIVE_FILE) +
+		node_page_state(pgdat, NR_ACTIVE_ANON) +
+		node_page_state(pgdat, NR_UNEVICTABLE);

In addition to what Johannes pointed out, these are whole-node numbers and
compaction works on a zone level.

+
+	nr_pinned = node_page_state(pgdat, NR_FOLL_PIN_ACQUIRED) -
+		node_page_state(pgdat, NR_FOLL_PIN_RELEASED);

Statistics of *events* used to derive current *state*... I don't think we do
that anywhere else? I'm not sure if we make sure vmstat events may never be
missed, as they are only for statistics. IIUC we allow some rare races to
have less expensive synchronization?

But anyway let's try looking for a different solution.

Assuming this is a THP allocation attempt (__GFP_THISNODE even?)
Yes, Transparent Huge Pages are allocated using the __GFP_THISNODE flag.
 and we are
in the "For costly allocations, try direct compaction first" part of
__alloc_pages_slowpath() right?
Yes, memory is being allocated using the following memory allocation 
strategy:

static struct page *alloc_pages_mpol()
{
     page = __alloc_frozen_pages_noprof(__GFP_THISNODE,...); // 1, try
to allocate THP only on local node

     if (page || !(gpf & __GFP_DIRECT_RECLAIM))
         return page;

     page = __alloc_frozen_pages_noprof(gfp, order, nid, nodemask);//2,
fall back to remote NUMA nodes
}

Hopefully also when done from the pin_user_pages_remote(..., FOLL_LONGTERM,
...) context the allocation gfp_mask correctly lacks __GFP_MOVABLE? 
yes.
I guess
it has to, otherwise it would allocate from the CMA pageblocks.

Then I wonder if we could use the real allocation context to determine
watermarks, as __compaction_suitable() is passing ALLOC_CMA instead because
it's checking only for migration targets, which have to be CMA compatible by
definition. But we could use the real unmovable allocation context to have
__zone_watermark_unusable_free() subtract CMA pages, and thus we won't pass
the order-0 check anymore once the non-CMA part is exhausted.

There's some risk that in some different scenario the compaction could in
fact migrate pages from the exhausted non-CMA part of the zone to the CMA
part and succeed, and we'll skip it instead. But that should be rare?

Below is the previous discussion：
https://lore.kernel.org/lkml/1734436004-1212-1-git-send-email-yangge1116@xxxxxxx/
Anyway given that concern I'm not sure about changing
__compaction_suitable() for every caller like this. We could (at least
initially) target this heuristic only for COMPACT_PRIO_ASYNC which is being
used for this THP opportunistic attempt.

So for example:
- add a new bool flag to compact_control that is true for COMPACT_PRIO_ASYNC
- pass cc pointer to compaction_suit_allocation_order()
- in that function, add another check if the the new cc flag is true,
between the current zone_watermark_ok() and compaction_suitable() checks,
which works like __compaction_suitable() but uses alloc_flags (which should
not be ALLOC_CMA in our pinned allocation case) instead of ALLOC_CMA, return
COMPACT_SKIPPED if it fails.

I will send a new version of the patch based on the suggestions here. 
Thank you.
+	/*
+	 * Gup-pinned pages are non-migratable. After subtracting these pages,
+	 * we need to check if the remaining pages are sufficient for memory
+	 * compaction.
+	 */
+	if ((sum - nr_pinned) < (1 << order))
+		return false;
+
  	/*
  	 * Watermarks for order-0 must be met for compaction to be able to
  	 * isolate free pages for migration targets. This means that the