Hello Yangge, On Tue, Dec 17, 2024 at 07:46:44PM +0800, yangge1116@xxxxxxx wrote: > From: yangge <yangge1116@xxxxxxx> > > Since commit 984fdba6a32e ("mm, compaction: use proper alloc_flags > in __compaction_suitable()") allow compaction to proceed when free > pages required for compaction reside in the CMA pageblocks, it's > possible that __compaction_suitable() always returns true, and in > some cases, it's not acceptable. > > There are 4 NUMA nodes on my machine, and each NUMA node has 32GB > of memory. I have configured 16GB of CMA memory on each NUMA node, > and starting a 32GB virtual machine with device passthrough is > extremely slow, taking almost an hour. > > During the start-up of the virtual machine, it will call > pin_user_pages_remote(..., FOLL_LONGTERM, ...) to allocate memory. > Long term GUP cannot allocate memory from CMA area, so a maximum > of 16 GB of no-CMA memory on a NUMA node can be used as virtual > machine memory. Since there is 16G of free CMA memory on the NUMA > node, watermark for order-0 always be met for compaction, so > __compaction_suitable() always returns true, even if the node is > unable to allocate non-CMA memory for the virtual machine. > > For costly allocations, because __compaction_suitable() always > returns true, __alloc_pages_slowpath() can't exit at the appropriate > place, resulting in excessively long virtual machine startup times. > Call trace: > __alloc_pages_slowpath > if (compact_result == COMPACT_SKIPPED || > compact_result == COMPACT_DEFERRED) > goto nopage; // should exit __alloc_pages_slowpath() from here > > Other unmovable alloctions, like dma_buf, which can be large in a > Linux system, are also unable to allocate memory from CMA, and these > allocations suffer from the same problems described above. In order > to quickly fall back to remote node, we should remove ALLOC_CMA both > in __compaction_suitable() and __isolate_free_page() for unmovable > alloctions. After this fix, starting a 32GB virtual machine with > device passthrough takes only a few seconds. The symptom is obviously bad, but I don't understand this fix. The reason we do ALLOC_CMA is that, even for unmovable allocations, you can create space in non-CMA space by moving migratable pages over to CMA space. This is not a property we want to lose. But I also don't see how it would interfere with your scenario. There is the compaction_suitable() check in should_compact_retry(), but that only applies when COMPACT_SKIPPED. IOW, it should only happen when compaction_suitable() just now returned false. IOW, a race condition. Which is why it's also not subject to limited retries. What's the exact condition that traps the allocator inside the loop?