This is the short term solutiolns "swap cluster order" listed in my "Swap Abstraction" discussion slice 8 in the recent LSF/MM conference. When commit 845982eb264bc "mm: swap: allow storage of all mTHP orders" is introduced, it only allocates the mTHP swap entries from new empty cluster list. It has a fragmentation issue reported by Barry. https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@xxxxxxxxxxxxxx/ The mTHP allocation failure rate raises to almost 100% after a few hours in Barry's test run. The reason is that all the empty cluster has been exhausted while there are planty of free swap entries to in the cluster that is not 100% free. Remember the swap allocation order in the cluster. Keep track of the per order non full cluster list for later allocation. This greatly improve the sucess rate of the mTHP swap allocation. There is some test number in the V1 thread of this series: https://lore.kernel.org/r/20240524-swap-allocator-v1-0-47861b423b26@xxxxxxxxxx Reported-by: Barry Song <21cnbao@xxxxxxxxx> Signed-off-by: Chris Li <chrisl@xxxxxxxxxx> --- Changes in v2: - Add the cluster state field to track the different phases of cluster allocations. - Rename "next" to "list" for the list field, suggested by Ying. - Update comment for the locking rules for cluster fields and listi, suggested by Ying. - Nonfull list avoid cluster on the per cpu active cluster. - Allocate from the nonfull list before attempting free list, suggested by Kairui. - Link to v1: https://lore.kernel.org/r/20240524-swap-allocator-v1-0-47861b423b26@xxxxxxxxxx --- Chris Li (2): mm: swap: swap cluster switch to double link list mm: swap: mTHP allocate swap entries from nonfull list include/linux/swap.h | 31 +++--- mm/swapfile.c | 270 ++++++++++++++++++--------------------------------- 2 files changed, 107 insertions(+), 194 deletions(-) --- base-commit: 19b8422c5bd56fb5e7085995801c6543a98bda1f change-id: 20240523-swap-allocator-1534c480ece4 Best regards, -- Chris Li <chrisl@xxxxxxxxxx>