This is the short term solutiolns "swap cluster order" listed in my "Swap Abstraction" discussion slice 8 in the recent LSF/MM conference. When commit 845982eb264bc "mm: swap: allow storage of all mTHP orders" is introduced, it only allocates the mTHP swap entries from new empty cluster list. That works well for PMD size THP, but it has a serius fragmentation issue reported by Barry. https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@xxxxxxxxxxxxxx/ The mTHP allocation failure rate raises to almost 100% after a few hours in Barry's test run. The reason is that all the empty cluster has been exhausted while there are planty of free swap entries to in the cluster that is not 100% free. Address this by remember the swap allocation order in the cluster. Keep track of the per order non full cluster list for later allocation. This greatly improve the sucess rate of the mTHP swap allocation. While I am still waiting for Barry's test result. I paste Kairui's test result here: I'm able to reproduce such an issue with a simple script (enabling all order of mthp): modprobe brd rd_nr=1 rd_size=$(( 10 * 1024 * 1024)) swapoff -a mkswap /dev/ram0 swapon /dev/ram0 rmdir /sys/fs/cgroup/benchmark mkdir -p /sys/fs/cgroup/benchmark cd /sys/fs/cgroup/benchmark echo 8G > memory.max echo $$ > cgroup.procs memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 32 -B binary & /usr/local/bin/memtier_benchmark -S /tmp/memcached.socket \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=18000000 --key-pattern=P:P -c 1 -t 32 \ --ratio 1:0 --pipeline 8 -d 1024 Before: Totals 48805.63 0.00 0.00 5.26045 1.19100 38.91100 59.64700 51063.98 After: Totals 71098.84 0.00 0.00 3.60585 0.71100 26.36700 39.16700 74388.74 And the fallback ratio dropped by a lot: Before: hugepages-32kB/stats/anon_swpout_fallback:15997 hugepages-32kB/stats/anon_swpout:18712 hugepages-512kB/stats/anon_swpout_fallback:192 hugepages-512kB/stats/anon_swpout:0 hugepages-2048kB/stats/anon_swpout_fallback:2 hugepages-2048kB/stats/anon_swpout:0 hugepages-1024kB/stats/anon_swpout_fallback:0 hugepages-1024kB/stats/anon_swpout:0 hugepages-64kB/stats/anon_swpout_fallback:18246 hugepages-64kB/stats/anon_swpout:17644 hugepages-16kB/stats/anon_swpout_fallback:13701 hugepages-16kB/stats/anon_swpout:18234 hugepages-256kB/stats/anon_swpout_fallback:8642 hugepages-256kB/stats/anon_swpout:93 hugepages-128kB/stats/anon_swpout_fallback:21497 hugepages-128kB/stats/anon_swpout:7596 (Still collecting more data, the success swpout was mostly done early, then the fallback began to increase, nearly 100% failure rate) After: hugepages-32kB/stats/swpout:34445 hugepages-32kB/stats/swpout_fallback:0 hugepages-512kB/stats/swpout:1 hugepages-512kB/stats/swpout_fallback:134 hugepages-2048kB/stats/swpout:1 hugepages-2048kB/stats/swpout_fallback:1 hugepages-1024kB/stats/swpout:6 hugepages-1024kB/stats/swpout_fallback:0 hugepages-64kB/stats/swpout:35495 hugepages-64kB/stats/swpout_fallback:0 hugepages-16kB/stats/swpout:32441 hugepages-16kB/stats/swpout_fallback:0 hugepages-256kB/stats/swpout:2223 hugepages-256kB/stats/swpout_fallback:6278 hugepages-128kB/stats/swpout:29136 hugepages-128kB/stats/swpout_fallback:52 Reported-by: Barry Song <21cnbao@xxxxxxxxx> Tested-by: Kairui Song <kasong@xxxxxxxxxxx> Signed-off-by: Chris Li <chrisl@xxxxxxxxxx> --- Chris Li (2): mm: swap: swap cluster switch to double link list mm: swap: mTHP allocate swap entries from nonfull list include/linux/swap.h | 18 ++-- mm/swapfile.c | 252 +++++++++++++++++---------------------------------- 2 files changed, 93 insertions(+), 177 deletions(-) --- base-commit: c65920c76a977c2b73c3a8b03b4c0c00cc1285ed change-id: 20240523-swap-allocator-1534c480ece4 Best regards, -- Chris Li <chrisl@xxxxxxxxxx>