On 2025/1/4 16:58, yangge1116@xxxxxxx wrote:
From: yangge <yangge1116@xxxxxxx> There are 4 NUMA nodes on my machine, and each NUMA node has 32GB of memory. I have configured 16GB of CMA memory on each NUMA node, and starting a 32GB virtual machine with device passthrough is extremely slow, taking almost an hour. During the start-up of the virtual machine, it will call pin_user_pages_remote(..., FOLL_LONGTERM, ...) to allocate memory. Long term GUP cannot allocate memory from CMA area, so a maximum of 16 GB of no-CMA memory on a NUMA node can be used as virtual machine memory. There is 16GB of free CMA memory on a NUMA node, which is sufficient to pass the order-0 watermark check, causing the __compaction_suitable() function to consistently return true. However, if there aren't enough migratable pages available, performing memory compaction is also meaningless. Besides checking whether the order-0 watermark is met, __compaction_suitable() also needs to determine whether there are sufficient migratable pages available for memory compaction. For costly allocations, because __compaction_suitable() always returns true, __alloc_pages_slowpath() can't exit at the appropriate place, resulting in excessively long virtual machine startup times. Call trace: __alloc_pages_slowpath if (compact_result == COMPACT_SKIPPED || compact_result == COMPACT_DEFERRED) goto nopage; // should exit __alloc_pages_slowpath() from here When the 16G of non-CMA memory on a single node is exhausted, we will fallback to allocating memory on other nodes. In order to quickly fallback to remote nodes, we should skip memory compaction when migratable pages are insufficient. After this fix, it only takes a few tens of seconds to start a 32GB virtual machine with device passthrough functionality. Signed-off-by: yangge <yangge1116@xxxxxxx> --- mm/compaction.c | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/mm/compaction.c b/mm/compaction.c index 07bd227..1c469b3 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -2383,7 +2383,26 @@ static bool __compaction_suitable(struct zone *zone, int order, int highest_zoneidx, unsigned long wmark_target) { + pg_data_t *pgdat = zone->zone_pgdat; + unsigned long sum, nr_pinned; unsigned long watermark; + + sum = node_page_state(pgdat, NR_INACTIVE_FILE) + + node_page_state(pgdat, NR_INACTIVE_ANON) + + node_page_state(pgdat, NR_ACTIVE_FILE) + + node_page_state(pgdat, NR_ACTIVE_ANON); + + nr_pinned = node_page_state(pgdat, NR_FOLL_PIN_ACQUIRED) - + node_page_state(pgdat, NR_FOLL_PIN_RELEASED); + + /* + * Gup-pinned pages are non-migratable. After subtracting these pages, + * we need to check if the remaining pages are sufficient for memory + * compaction. + */ + if ((sum - nr_pinned) < (1 << order)) + return false; +
IMO, using the node's statistics to determine whether the zone is suitable for compaction doesn't make sense. It is possible that even though the normal zone has long-term pinned pages, the movable zone can still be suitable for compaction.