Re: [PATCH 00/13] mm, swap: rework of swap allocator locks

Yosry Ahmed <yosryahmed@xxxxxxxxxx> · Wed, 23 Oct 2024 10:59:42 -0700

On Tue, Oct 22, 2024 at 12:29 PM Kairui Song <ryncsn@xxxxxxxxx> wrote:
>
> From: Kairui Song <kasong@xxxxxxxxxxx>
>
> This series improved the swap allocator performance greatly by reworking
> the locking design and simplify a lot of code path.
>
> This is follow up of previous swap cluster allocator series:
> https://lore.kernel.org/linux-mm/20240730-swap-allocator-v5-0-cb9c148b9297@xxxxxxxxxx/
>
> And this series is based on an follow up fix of the swap cluster
> allocator:
> https://lore.kernel.org/linux-mm/20241022175512.10398-1-ryncsn@xxxxxxxxx/
>
> This is part of the new swap allocator work item discussed in
> Chris's "Swap Abstraction" discussion at LSF/MM 2024, and
> "mTHP and swap allocator" discussion at LPC 2024.
>
> Previous series introduced a fully cluster based allocation algorithm,
> this series completely get rid of the old allocation path and makes the
> allocator avoid grabbing the si->lock unless needed. This bring huge
> performance gain and get rid of slot cache on freeing path.
>
> Currently, swap locking is mainly composed of two locks, cluster lock
> (ci->lock) and device lock (si->lock). The device lock is widely used
> to protect many things, causing it to be the main bottleneck for SWAP.
>
> Cluster lock is much more fine-grained, so it will be best to use
> ci->lock instead of si->lock as much as possible.
>
> `perf lock` indicates this issue clearly. Doing linux kernel build
> using tmpfs and ZRAM with limited memory (make -j64 with 1G memcg and 4k
> pages), result of "perf lock contention -ab sleep 3":
>
>   contended   total wait     max wait     avg wait         type   caller
>
>      34948     53.63 s       7.11 ms      1.53 ms     spinlock   free_swap_and_cache_nr+0x350
>      16569     40.05 s       6.45 ms      2.42 ms     spinlock   get_swap_pages+0x231
>      11191     28.41 s       7.03 ms      2.54 ms     spinlock   swapcache_free_entries+0x59
>       4147     22.78 s     122.66 ms      5.49 ms     spinlock   page_vma_mapped_walk+0x6f3
>       4595      7.17 s       6.79 ms      1.56 ms     spinlock   swapcache_free_entries+0x59
>     406027      2.74 s       2.59 ms      6.74 us     spinlock   list_lru_add+0x39
>   ...snip...
>
> The top 5 caller are all users of si->lock, total wait time up sums to
> several minutes in the 3 seconds time window.
>
> Following the new allocator design, many operation doesn't need to touch
> si->lock at all. We only need to take si->lock when doing operations
> across multiple clusters (eg. changing the cluster list), other
> operations only need to take ci->lock. So ideally allocator should
> always take ci->lock first, then, if needed, take si->lock. But due
> to historical reasons, ci->lock is used inside si->lock by design,
> causing lock inversion if we simply try to acquire si->lock after
> acquiring ci->lock.
>
> This series audited all si->lock usage, simplify legacy codes, eliminate
> usage of si->lock as much as possible by introducing new designs based
> on the new cluster allocator.
>
> Old HDD allocation codes are removed, cluster allocator is adapted
> with small changes for HDD usage, test is looking OK.
>
> And this also removed slot cache for freeing path. The performance is
> better without it, and this enables other clean up and optimizations
> as discussed before:
> https://lore.kernel.org/all/CAMgjq7ACohT_uerSz8E_994ZZCv709Zor+43hdmesW_59W1BWw@xxxxxxxxxxxxxx/
>
> After this series, lock contention on si->lock is nearly unobservable
> with `perf lock` with the same test above :
>
>   contended   total wait     max wait     avg wait         type   caller
>   ... snip ...
>          91    204.62 us      4.51 us      2.25 us     spinlock   cluster_move+0x2e
>   ... snip ...
>          47    125.62 us      4.47 us      2.67 us     spinlock   cluster_move+0x2e
>   ... snip ...
>          23     63.15 us      3.95 us      2.74 us     spinlock   cluster_move+0x2e
>   ... snip ...
>          17     41.26 us      4.58 us      2.43 us     spinlock   cluster_isolate_lock+0x1d
>   ... snip ...
>
> cluster_move and cluster_isolate_lock are basically the only users
> of si->lock now, performance gain is huge with reduced LOC.
>
> Tests
> ===
>
> Build kernel with defconfig on tmpfs with ZRAM as swap:
> ---
>
> Running a test matrix which is scaled up progressive for a intuitive result.
> The test are ran on top of tmpfs, using memory cgroup for memory limitation,
> on a 48c96t system.
>
> 12 test run for each case, it can be seen clearly that as concurrent job
> number goes higher the performance gain is higher, the performance is
> higher even with low concurrency.
>
>    make -j<NR>     |   System Time (seconds)  |   Total Time (seconds)
>  (NR / Mem / ZRAM) | (Before / After / Delta) | (Before / After / Delta)
>  With 4k pages only:
>   6 / 192M / 3G    |    5258 /  5235 / -0.3%  |    1420 /  1414 / -0.3%
>  12 / 256M / 4G    |    5518 /  5337 / -3.3%  |     758 /   742 / -2.1%
>  24 / 384M / 5G    |    7091 /  5766 / -18.7% |     476 /   422 / -11.3%
>  48 / 768M / 7G    |   11139 /  5831 / -47.7% |     330 /   221 / -33.0%
>  96 / 1.5G / 10G   |   21303 / 11353 / -46.7% |     283 /   180 / -36.4%
>  With 64k mTHP:
>  24 / 512M / 5G    |    5104 /  4641 / -18.7% |     376 /   358 / -4.8%
>  48 /   1G / 7G    |    8693 /  4662 / -18.7% |     257 /   176 / -31.5%
>  96 /   2G / 10G   |   17056 / 10263 / -39.8% |     234 /   169 / -27.8%
>
> With more aggressive setup, it shows clearly both the performance and
> fragmentation are better:
>
> tiem make -j96 / 768M memcg, 4K pages, 10G ZRAM, on Intel 8255C * 2:
> (avg of 4 test run)
> Before:
> Sys time: 73578.30, Real time: 864.05
> tiem make -j96 / 1G memcg, 4K pages, 10G ZRAM:
> After: (-54.7% sys time, -49.3% real time)
> Sys time: 33314.76, Real time: 437.67
>
> time make -j96 / 1152M memcg, 64K mTHP, 10G ZRAM, on Intel 8255C * 2:
> (avg of 4 test run)
> Before:
> Sys time: 74044.85, Real time: 846.51
> hugepages-64kB/stats/swpout: 1735216
> hugepages-64kB/stats/swpout_fallback: 430333
> After: (-51.4% sys time, -47.7% real time, -63.2% mTHP failure)
> Sys time: 35958.87, Real time: 442.69
> hugepages-64kB/stats/swpout: 1866267
> hugepages-64kB/stats/swpout_fallback: 158330
>
> There is a up to 54.7% improvement for build kernel test, and lower
> fragmentation rate. Performance improvement should be even larger for
> micro benchmarks
>
> Build kernel with tinyconfig on tmpfs with HDD as swap:
> ---
>
> This test is similar to above, but HDD test is very noisy and slow, the
> deviation is huge, so just use tinyconfig instead and take the median test
> result of 3 test run, which looks OK:
>
> Before this series:
> 114.44user 29.11system 39:42.90elapsed 6%CPU
> 2901232inputs+0outputs (238877major+4227640minor)pagefaults
>
> After this commit:
> 113.90user 23.81system 38:11.77elapsed 6%CPU
> 2548728inputs+0outputs (235471major+4238110minor)pagefaults
>
> Single thread SWAP:
> ---
>
> Sequential SWAP should also be slightly faster as we removed a lot of
> unnecessary parts. Test using micro benchmark for swapout/in 4G
> zero memory using ZRAM, 10 test runs:
>
> Swapout Before (avg. 3359304):
> 3353796 3358551 3371305 3356043 3367524 3355303 3355924 3354513 3360776
>
> Swapin Before (avg. 1928698):
> 1920283 1927183 1934105 1921373 1926562 1938261 1927726 1928636 1934155
>
> Swapout After (avg. 3347511, -0.4%):
> 3337863 3347948 3355235 3339081 3333134 3353006 3354917 3346055 3360359
>
> Swapin After (avg. 1922290, -0.3%):
> 1919101 1925743 1916810 1917007 1923930 1935152 1917403 1923549 1921913

Unfortunately I don't have the time to go through this series, but I
just wanted to say that this awesome work, Kairui.

Selfishly, I especially like cleaning up the swap slot freeing path,
and having a centralized freeing path with a single call to
zswap_invalidate().

Thanks for doing this :)