On 14 Mar 2025, at 16:50, Johannes Weiner wrote: > On Fri, Mar 14, 2025 at 02:54:03PM -0400, Zi Yan wrote: >> On 13 Mar 2025, at 17:05, Johannes Weiner wrote: >> >>> The page allocator groups requests by migratetype to stave off >>> fragmentation. However, in practice this is routinely defeated by the >>> fact that it gives up *before* invoking reclaim and compaction - which >>> may well produce suitable pages. As a result, fragmentation of >>> physical memory is a common ongoing process in many load scenarios. >>> >>> Fragmentation deteriorates compaction's ability to produce huge >>> pages. Depending on the lifetime of the fragmenting allocations, those >>> effects can be long-lasting or even permanent, requiring drastic >>> measures like forcible idle states or even reboots as the only >>> reliable ways to recover the address space for THP production. >>> >>> In a kernel build test with supplemental THP pressure, the THP >>> allocation rate steadily declines over 15 runs: >>> >>> thp_fault_alloc >>> 61988 >>> 56474 >>> 57258 >>> 50187 >>> 52388 >>> 55409 >>> 52925 >>> 47648 >>> 43669 >>> 40621 >>> 36077 >>> 41721 >>> 36685 >>> 34641 >>> 33215 >>> >>> This is a hurdle in adopting THP in any environment where hosts are >>> shared between multiple overlapping workloads (cloud environments), >>> and rarely experience true idle periods. To make THP a reliable and >>> predictable optimization, there needs to be a stronger guarantee to >>> avoid such fragmentation. >>> >>> Introduce defrag_mode. When enabled, reclaim/compaction is invoked to >>> its full extent *before* falling back. Specifically, ALLOC_NOFRAGMENT >>> is enforced on the allocator fastpath and the reclaiming slowpath. >>> >>> For now, fallbacks are permitted to avert OOMs. There is a plan to add >>> defrag_mode=2 to prefer OOMs over fragmentation, but this requires >>> additional prep work in compaction and the reserve management to make >>> it ready for all possible allocation contexts. >>> >>> The following test results are from a kernel build with periodic >>> bursts of THP allocations, over 15 runs: >>> >>> vanilla defrag_mode=1 >>> @claimer[unmovable]: 189 103 >>> @claimer[movable]: 92 103 >>> @claimer[reclaimable]: 207 61 >>> @pollute[unmovable from movable]: 25 0 >>> @pollute[unmovable from reclaimable]: 28 0 >>> @pollute[movable from unmovable]: 38835 0 >>> @pollute[movable from reclaimable]: 147136 0 >>> @pollute[reclaimable from unmovable]: 178 0 >>> @pollute[reclaimable from movable]: 33 0 >>> @steal[unmovable from movable]: 11 0 >>> @steal[unmovable from reclaimable]: 5 0 >>> @steal[reclaimable from unmovable]: 107 0 >>> @steal[reclaimable from movable]: 90 0 >>> @steal[movable from reclaimable]: 354 0 >>> @steal[movable from unmovable]: 130 0 >>> >>> Both types of polluting fallbacks are eliminated in this workload. >>> >>> Interestingly, whole block conversions are reduced as well. This is >>> because once a block is claimed for a type, its empty space remains >>> available for future allocations, instead of being padded with >>> fallbacks; this allows the native type to group up instead of >>> spreading out to new blocks. The assumption in the allocator has been >>> that pollution from movable allocations is less harmful than from >>> other types, since they can be reclaimed or migrated out should the >>> space be needed. However, since fallbacks occur *before* >>> reclaim/compaction is invoked, movable pollution will still cause >>> non-movable allocations to spread out and claim more blocks. >>> >>> Without fragmentation, THP rates hold steady with defrag_mode=1: >>> >>> thp_fault_alloc >>> 32478 >>> 20725 >>> 45045 >>> 32130 >>> 14018 >>> 21711 >>> 40791 >>> 29134 >>> 34458 >>> 45381 >>> 28305 >>> 17265 >>> 22584 >>> 28454 >>> 30850 >>> >>> While the downward trend is eliminated, the keen reader will of course >>> notice that the baseline rate is much smaller than the vanilla >>> kernel's to begin with. This is due to deficiencies in how reclaim and >>> compaction are currently driven: ALLOC_NOFRAGMENT increases the extent >>> to which smaller allocations are competing with THPs for pageblocks, >>> while making no effort themselves to reclaim or compact beyond their >>> own request size. This effect already exists with the current usage of >>> ALLOC_NOFRAGMENT, but is amplified by defrag_mode insisting on whole >>> block stealing much more strongly. >>> >>> Subsequent patches will address defrag_mode reclaim strategy to raise >>> the THP success baseline above the vanilla kernel. >> >> All makes sense to me. But is there a better name than defrag_mode? >> It sounds very similar to /sys/kernel/mm/transparent_hugepage/defrag. >> Or it actually means the THP defrag mode? > > Thanks for taking a look! > > I'm not set on defrag_mode, but I also couldn't think of anything > better. > > The proximity to the THP flag name strikes me as beneficial, since > it's an established term for "try harder to make huge pages". > > Suggestions welcome :) > >>> Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx> >>> --- >>> Documentation/admin-guide/sysctl/vm.rst | 9 +++++++++ >>> mm/page_alloc.c | 27 +++++++++++++++++++++++-- >>> 2 files changed, 34 insertions(+), 2 deletions(-) >>> >> >> When I am checking ALLOC_NOFRAGMENT, I find that in get_page_from_freelist(), >> ALLOC_NOFRAGMENT is removed when allocation goes into a remote node. I wonder >> if this could reduce the anti-fragmentation effort for NUMA systems. Basically, >> falling back to a remote node for allocation would fragment the remote node, >> even the remote node is trying hard to not fragment itself. Have you tested >> on a NUMA system? > > There is this hunk in the patch: > > @@ -3480,7 +3486,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags, > continue; > } > > - if (no_fallback && nr_online_nodes > 1 && > + if (no_fallback && !defrag_mode && nr_online_nodes > 1 && > zone != zonelist_zone(ac->preferred_zoneref)) { > int local_nid; > > So it shouldn't clear the flag when spilling into the next node. > > Am I missing something? Oh, I missed that part. Thank you for pointing it out. Best Regards, Yan, Zi