On Tue, Oct 22, 2024 at 12:29 PM Kairui Song <ryncsn@xxxxxxxxx> wrote: > > From: Kairui Song <kasong@xxxxxxxxxxx> > > This series improved the swap allocator performance greatly by reworking > the locking design and simplify a lot of code path. > > This is follow up of previous swap cluster allocator series: > https://lore.kernel.org/linux-mm/20240730-swap-allocator-v5-0-cb9c148b9297@xxxxxxxxxx/ > > And this series is based on an follow up fix of the swap cluster > allocator: > https://lore.kernel.org/linux-mm/20241022175512.10398-1-ryncsn@xxxxxxxxx/ > > This is part of the new swap allocator work item discussed in > Chris's "Swap Abstraction" discussion at LSF/MM 2024, and > "mTHP and swap allocator" discussion at LPC 2024. > > Previous series introduced a fully cluster based allocation algorithm, > this series completely get rid of the old allocation path and makes the > allocator avoid grabbing the si->lock unless needed. This bring huge > performance gain and get rid of slot cache on freeing path. > > Currently, swap locking is mainly composed of two locks, cluster lock > (ci->lock) and device lock (si->lock). The device lock is widely used > to protect many things, causing it to be the main bottleneck for SWAP. > > Cluster lock is much more fine-grained, so it will be best to use > ci->lock instead of si->lock as much as possible. > > `perf lock` indicates this issue clearly. Doing linux kernel build > using tmpfs and ZRAM with limited memory (make -j64 with 1G memcg and 4k > pages), result of "perf lock contention -ab sleep 3": > > contended total wait max wait avg wait type caller > > 34948 53.63 s 7.11 ms 1.53 ms spinlock free_swap_and_cache_nr+0x350 > 16569 40.05 s 6.45 ms 2.42 ms spinlock get_swap_pages+0x231 > 11191 28.41 s 7.03 ms 2.54 ms spinlock swapcache_free_entries+0x59 > 4147 22.78 s 122.66 ms 5.49 ms spinlock page_vma_mapped_walk+0x6f3 > 4595 7.17 s 6.79 ms 1.56 ms spinlock swapcache_free_entries+0x59 > 406027 2.74 s 2.59 ms 6.74 us spinlock list_lru_add+0x39 > ...snip... > > The top 5 caller are all users of si->lock, total wait time up sums to > several minutes in the 3 seconds time window. > > Following the new allocator design, many operation doesn't need to touch > si->lock at all. We only need to take si->lock when doing operations > across multiple clusters (eg. changing the cluster list), other > operations only need to take ci->lock. So ideally allocator should > always take ci->lock first, then, if needed, take si->lock. But due > to historical reasons, ci->lock is used inside si->lock by design, > causing lock inversion if we simply try to acquire si->lock after > acquiring ci->lock. > > This series audited all si->lock usage, simplify legacy codes, eliminate > usage of si->lock as much as possible by introducing new designs based > on the new cluster allocator. > > Old HDD allocation codes are removed, cluster allocator is adapted > with small changes for HDD usage, test is looking OK. > > And this also removed slot cache for freeing path. The performance is > better without it, and this enables other clean up and optimizations > as discussed before: > https://lore.kernel.org/all/CAMgjq7ACohT_uerSz8E_994ZZCv709Zor+43hdmesW_59W1BWw@xxxxxxxxxxxxxx/ > > After this series, lock contention on si->lock is nearly unobservable > with `perf lock` with the same test above : > > contended total wait max wait avg wait type caller > ... snip ... > 91 204.62 us 4.51 us 2.25 us spinlock cluster_move+0x2e > ... snip ... > 47 125.62 us 4.47 us 2.67 us spinlock cluster_move+0x2e > ... snip ... > 23 63.15 us 3.95 us 2.74 us spinlock cluster_move+0x2e > ... snip ... > 17 41.26 us 4.58 us 2.43 us spinlock cluster_isolate_lock+0x1d > ... snip ... > > cluster_move and cluster_isolate_lock are basically the only users > of si->lock now, performance gain is huge with reduced LOC. > > Tests > === > > Build kernel with defconfig on tmpfs with ZRAM as swap: > --- > > Running a test matrix which is scaled up progressive for a intuitive result. > The test are ran on top of tmpfs, using memory cgroup for memory limitation, > on a 48c96t system. > > 12 test run for each case, it can be seen clearly that as concurrent job > number goes higher the performance gain is higher, the performance is > higher even with low concurrency. > > make -j<NR> | System Time (seconds) | Total Time (seconds) > (NR / Mem / ZRAM) | (Before / After / Delta) | (Before / After / Delta) > With 4k pages only: > 6 / 192M / 3G | 5258 / 5235 / -0.3% | 1420 / 1414 / -0.3% > 12 / 256M / 4G | 5518 / 5337 / -3.3% | 758 / 742 / -2.1% > 24 / 384M / 5G | 7091 / 5766 / -18.7% | 476 / 422 / -11.3% > 48 / 768M / 7G | 11139 / 5831 / -47.7% | 330 / 221 / -33.0% > 96 / 1.5G / 10G | 21303 / 11353 / -46.7% | 283 / 180 / -36.4% > With 64k mTHP: > 24 / 512M / 5G | 5104 / 4641 / -18.7% | 376 / 358 / -4.8% > 48 / 1G / 7G | 8693 / 4662 / -18.7% | 257 / 176 / -31.5% > 96 / 2G / 10G | 17056 / 10263 / -39.8% | 234 / 169 / -27.8% > > With more aggressive setup, it shows clearly both the performance and > fragmentation are better: > > tiem make -j96 / 768M memcg, 4K pages, 10G ZRAM, on Intel 8255C * 2: > (avg of 4 test run) > Before: > Sys time: 73578.30, Real time: 864.05 > tiem make -j96 / 1G memcg, 4K pages, 10G ZRAM: > After: (-54.7% sys time, -49.3% real time) > Sys time: 33314.76, Real time: 437.67 > > time make -j96 / 1152M memcg, 64K mTHP, 10G ZRAM, on Intel 8255C * 2: > (avg of 4 test run) > Before: > Sys time: 74044.85, Real time: 846.51 > hugepages-64kB/stats/swpout: 1735216 > hugepages-64kB/stats/swpout_fallback: 430333 > After: (-51.4% sys time, -47.7% real time, -63.2% mTHP failure) > Sys time: 35958.87, Real time: 442.69 > hugepages-64kB/stats/swpout: 1866267 > hugepages-64kB/stats/swpout_fallback: 158330 > > There is a up to 54.7% improvement for build kernel test, and lower > fragmentation rate. Performance improvement should be even larger for > micro benchmarks > > Build kernel with tinyconfig on tmpfs with HDD as swap: > --- > > This test is similar to above, but HDD test is very noisy and slow, the > deviation is huge, so just use tinyconfig instead and take the median test > result of 3 test run, which looks OK: > > Before this series: > 114.44user 29.11system 39:42.90elapsed 6%CPU > 2901232inputs+0outputs (238877major+4227640minor)pagefaults > > After this commit: > 113.90user 23.81system 38:11.77elapsed 6%CPU > 2548728inputs+0outputs (235471major+4238110minor)pagefaults > > Single thread SWAP: > --- > > Sequential SWAP should also be slightly faster as we removed a lot of > unnecessary parts. Test using micro benchmark for swapout/in 4G > zero memory using ZRAM, 10 test runs: > > Swapout Before (avg. 3359304): > 3353796 3358551 3371305 3356043 3367524 3355303 3355924 3354513 3360776 > > Swapin Before (avg. 1928698): > 1920283 1927183 1934105 1921373 1926562 1938261 1927726 1928636 1934155 > > Swapout After (avg. 3347511, -0.4%): > 3337863 3347948 3355235 3339081 3333134 3353006 3354917 3346055 3360359 > > Swapin After (avg. 1922290, -0.3%): > 1919101 1925743 1916810 1917007 1923930 1935152 1917403 1923549 1921913 Unfortunately I don't have the time to go through this series, but I just wanted to say that this awesome work, Kairui. Selfishly, I especially like cleaning up the swap slot freeing path, and having a centralized freeing path with a single call to zswap_invalidate(). Thanks for doing this :)