Chris Li <chrisl@xxxxxxxxxx> writes: > I am spinning a new version for this series to address two issues > found in this series: > > 1) Oppo discovered a bug in the following line: > + ci = si->cluster_info + tmp; > Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp". > That is a serious bug but trivial to fix. > > 2) order 0 allocation currently blindly scans swap_map disregarding > the cluster->order. IIUC, now, we only scan swap_map[] only if !list_empty(&si->free_clusters) && !list_empty(&si->nonfull_clusters[order]). That is, if you doesn't run low swap free space, you will not do that. > Given enough order 0 swap allocations(close to the > swap file size) the order 0 allocation head will eventually sweep > across the whole swapfile and destroy other cluster order allocations. > > The short term fix is just skipping clusters that are already assigned > to higher orders. Better to do any further optimization on top of the simpler one. Need to evaluate whether it's necessary to add more complexity. > In the long term, I want to unify the non-SSD to use clusters for > locking and allocations as well, just try to follow the last > allocation (less seeking) as much as possible. I have thought about that too. Personally, I think that it's good to remove swap_map[] scanning. The implementation can be simplified too. I don't know whether do we need to consider the performance of HDD swap now. -- Best Regards, Huang, Ying > On Fri, May 24, 2024 at 10:17 AM Chris Li <chrisl@xxxxxxxxxx> wrote: >> >> This is the short term solutiolns "swap cluster order" listed >> in my "Swap Abstraction" discussion slice 8 in the recent >> LSF/MM conference. >> >> When commit 845982eb264bc "mm: swap: allow storage of all mTHP >> orders" is introduced, it only allocates the mTHP swap entries >> from new empty cluster list. That works well for PMD size THP, >> but it has a serius fragmentation issue reported by Barry. >> >> https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@xxxxxxxxxxxxxx/ >> >> The mTHP allocation failure rate raises to almost 100% after a few >> hours in Barry's test run. >> >> The reason is that all the empty cluster has been exhausted while >> there are planty of free swap entries to in the cluster that is >> not 100% free. >> >> Address this by remember the swap allocation order in the cluster. >> Keep track of the per order non full cluster list for later allocation. >> >> This greatly improve the sucess rate of the mTHP swap allocation. >> While I am still waiting for Barry's test result. I paste Kairui's test >> result here: >> >> I'm able to reproduce such an issue with a simple script (enabling all order of mthp): >> >> modprobe brd rd_nr=1 rd_size=$(( 10 * 1024 * 1024)) >> swapoff -a >> mkswap /dev/ram0 >> swapon /dev/ram0 >> >> rmdir /sys/fs/cgroup/benchmark >> mkdir -p /sys/fs/cgroup/benchmark >> cd /sys/fs/cgroup/benchmark >> echo 8G > memory.max >> echo $$ > cgroup.procs >> >> memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 32 -B binary & >> >> /usr/local/bin/memtier_benchmark -S /tmp/memcached.socket \ >> -P memcache_binary -n allkeys --key-minimum=1 \ >> --key-maximum=18000000 --key-pattern=P:P -c 1 -t 32 \ >> --ratio 1:0 --pipeline 8 -d 1024 >> >> Before: >> Totals 48805.63 0.00 0.00 5.26045 1.19100 38.91100 59.64700 51063.98 >> After: >> Totals 71098.84 0.00 0.00 3.60585 0.71100 26.36700 39.16700 74388.74 >> >> And the fallback ratio dropped by a lot: >> Before: >> hugepages-32kB/stats/anon_swpout_fallback:15997 >> hugepages-32kB/stats/anon_swpout:18712 >> hugepages-512kB/stats/anon_swpout_fallback:192 >> hugepages-512kB/stats/anon_swpout:0 >> hugepages-2048kB/stats/anon_swpout_fallback:2 >> hugepages-2048kB/stats/anon_swpout:0 >> hugepages-1024kB/stats/anon_swpout_fallback:0 >> hugepages-1024kB/stats/anon_swpout:0 >> hugepages-64kB/stats/anon_swpout_fallback:18246 >> hugepages-64kB/stats/anon_swpout:17644 >> hugepages-16kB/stats/anon_swpout_fallback:13701 >> hugepages-16kB/stats/anon_swpout:18234 >> hugepages-256kB/stats/anon_swpout_fallback:8642 >> hugepages-256kB/stats/anon_swpout:93 >> hugepages-128kB/stats/anon_swpout_fallback:21497 >> hugepages-128kB/stats/anon_swpout:7596 >> >> (Still collecting more data, the success swpout was mostly done early, then the fallback began to increase, nearly 100% failure rate) >> >> After: >> hugepages-32kB/stats/swpout:34445 >> hugepages-32kB/stats/swpout_fallback:0 >> hugepages-512kB/stats/swpout:1 >> hugepages-512kB/stats/swpout_fallback:134 >> hugepages-2048kB/stats/swpout:1 >> hugepages-2048kB/stats/swpout_fallback:1 >> hugepages-1024kB/stats/swpout:6 >> hugepages-1024kB/stats/swpout_fallback:0 >> hugepages-64kB/stats/swpout:35495 >> hugepages-64kB/stats/swpout_fallback:0 >> hugepages-16kB/stats/swpout:32441 >> hugepages-16kB/stats/swpout_fallback:0 >> hugepages-256kB/stats/swpout:2223 >> hugepages-256kB/stats/swpout_fallback:6278 >> hugepages-128kB/stats/swpout:29136 >> hugepages-128kB/stats/swpout_fallback:52 >> >> Reported-by: Barry Song <21cnbao@xxxxxxxxx> >> Tested-by: Kairui Song <kasong@xxxxxxxxxxx> >> Signed-off-by: Chris Li <chrisl@xxxxxxxxxx> >> --- >> Chris Li (2): >> mm: swap: swap cluster switch to double link list >> mm: swap: mTHP allocate swap entries from nonfull list >> >> include/linux/swap.h | 18 ++-- >> mm/swapfile.c | 252 +++++++++++++++++---------------------------------- >> 2 files changed, 93 insertions(+), 177 deletions(-) >> --- >> base-commit: c65920c76a977c2b73c3a8b03b4c0c00cc1285ed >> change-id: 20240523-swap-allocator-1534c480ece4 >> >> Best regards, >> -- >> Chris Li <chrisl@xxxxxxxxxx> >>