Hi All, Chris has been doing great work at [1] to clean up my mess in the mTHP swap entry allocator. But Barry posted a test program and results at [2] showing that even with Chris's changes, there are still some fallbacks (around 5% - 25% in some cases). I was interested in why that might be and ended up putting this PoC patch set together to try to get a better understanding. This series ends up achieving 0% fallback, even with small folios ("-s") enabled. I haven't done much testing beyond that (yet) but thought it was worth posting on the strength of that result alone. At a high level this works in a similar way to Chris's series; it marks a cluster as being for a particular order and if a new cluster cannot be allocated then it scans through the existing non-full clusters. But it does it by scanning through the clusters rather than assembling them into a list. Cluster flags are used to mark clusters that have been scanned and are known not to have enough contiguous space, so the efficiency should be similar in practice. Because its not based around a linked list, there is less churn and I'm wondering if this is perhaps easier to review and potentially even get into v6.10-rcX to fix up what's already there, rather than having to wait until v6.11 for Chris's series? I know Chris has a larger roadmap of improvements, so at best I see this as a tactical fix that will ultimately be superseeded by Chris's work. There are a few differences to note vs Chris's series: - order-0 fallback scanning is still allowed in any cluster; the argument in the past was that swap should always use all the swap space, so I've left this mechanism in. It is only a fallback though; first the the new per-order scanner is invoked, even for order-0, so if there are free slots in clusters already assigned for order-0, then the allocation will go there. - CPUs can steal slots from other CPU's current clusters; those clusters remain scannable while they are current for a CPU and are only made unscannable when no more CPUs are scanning that particular cluster. - I'm preferring to allocate a free cluster ahead of per-order scanning, since, as I understand it, the original intent of a per-cpu current cluster was to get pages for an application adjacent in the swap to speed up IO. I'd be keen to hear if you think we could get something like this into v6.10 to fix the mess - I'm willing to work quickly to address comments and do more testing. If not, then this is probably just a distraction and we should concentrate on Chris's series. This applies on top of v6.10-rc4. [1] https://lore.kernel.org/linux-mm/20240614-swap-allocator-v2-0-2a513b4a7f2f@xxxxxxxxxx/ [2] https://lore.kernel.org/linux-mm/20240615084714.37499-1-21cnbao@xxxxxxxxx/ Thanks, Ryan Ryan Roberts (5): mm: swap: Simplify end-of-cluster calculation mm: swap: Change SWAP_NEXT_INVALID to highest value mm: swap: Track allocation order for clusters mm: swap: Scan for free swap entries in allocated clusters mm: swap: Optimize per-order cluster scanning include/linux/swap.h | 18 +++-- mm/swapfile.c | 164 ++++++++++++++++++++++++++++++++++++++----- 2 files changed, 157 insertions(+), 25 deletions(-) -- 2.43.0