Hi, Kairui, Kairui Song <ryncsn@xxxxxxxxx> writes: > From: Kairui Song <kasong@xxxxxxxxxxx> > > A month ago a bug was fixed for SWP_SYNCHRONOUS_IO swapin (swap cache > bypass swapin): > https://lore.kernel.org/linux-mm/20240219082040.7495-1-ryncsn@xxxxxxxxx/ > > Because we have to spin on the swap map on race, and swap map is too small > to contain more usable info, an ugly schedule_timeout_uninterruptible(1) > is added. It's not the first time a hackish workaround was added for cache > bypass swapin and not the last time. I did many experiments locally to > see if the swap cache bypass path can be dropped while keeping the > performance still comparable. And it seems doable. > In general, I think that it's a good idea to unify cache bypass swapin and normal swapin. But I haven't dive into the implementation yet. > This series does the following things: > 1. Remove swap cache bypass completely. > 2. Apply multiple optimizations after that, these optimizations are > either undoable or very difficult to do without dropping the cache > bypass swapin path. > 3. Use swap cache as a synchronization layer, also unify some code > with page cache (filemap). > > As a result, we have: > 1. A comparable performance, some tests are even faster. > 2. Multi-index support for swap cache. > 3. Removed many hackish workarounds including above long tailing > issue is gone. > > Sending this as RFC to collect some discussion, suggestion, or rejection > early, this seems need to be split into multiple series, but the > performance is not good until the last patch so I think start by > seperating them may make this approach not very convincing. And there > are still some (maybe further) TODO items and optimization space > if we are OK with this approach. > > This is based on my another series, for reusing filemap code for swapcache: > [PATCH v2 0/4] mm/filemap: optimize folio adding and splitting > https://lore.kernel.org/linux-mm/20240325171405.99971-1-ryncsn@xxxxxxxxx/ > > Patch 1/10, introduce a helper from filemap side to be used later. > Patch 2/10, 3/10 are clean up and prepare for removing the swap cache > bypass swapin path. > Patch 4/10, removed the swap cache bypass swapin path, and the > performance drop heavily (-28%). > Patch 5/10, apply the first optimization after the removal, since all > folios goes through swap cache now, there is no need to explicit shadow > clearing any more. > Patch 6/10, apply another optimization after clean up shadow clearing > routines. Now swapcache is very alike page cache, so just reuse page > cache code and we will have multi-index support. Shadow memory usage > dropped a lot. > Patch 7/10, just rename __read_swap_cache_async, it will be refactored > and a key part of this series, and the naming is very confusing to me. > Patch 8/10, make swap cache as a synchronization layer, introduce two > helpers for adding folios to swap cache, caller will either succeed or > get a folio to wait on. > Patch 9/10, apply another optimization. With above two helpers, looking > up of swapcache can be optimized and avoid false looking up, which > helped improve the performance. > Patch 10/10, apply a major optimization for SWP_SYNCHRONOUS_IO devices, > after this commit, performance for simple swapin/swapout is basically > same as before. > > Test 1, sequential swapin/out of 30G zero page on ZRAM: > > Before (us) After (us) > Swapout: 33619409 33886008 > Swapin: 32393771 32465441 (- 0.2%) > Swapout (THP): 7817909 6899938 (+11.8%) > Swapin (THP) : 32452387 33193479 (- 2.2%) If my understanding were correct, we don't have swapin (THP) support, yet. Right? > And after swapping out 30G with THP, the radix node usage dropped by a > lot: > > Before: radix_tree_node 73728K > After: radix_tree_node 7056K (-94%) Good! > Test 2: > Mysql (16g buffer pool, 32G ZRAM SWAP, 4G memcg, Zswap disabled, THP never) > sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-user=root \ > --mysql-password=1234 --mysql-db=sb --tables=36 --table-size=2000000 \ > --threads=48 --time=300 --report-interval=10 run > > Before: transactions: 4849.25 per sec > After: transactions: 4849.40 per sec > > Test 3: > Mysql (16g buffer pool, NVME SWAP, 4G memcg, Zswap enabled, THP never) > echo never > /sys/kernel/mm/transparent_hugepage/enabled > echo 100 > /sys/module/zswap/parameters/max_pool_percent > echo 1 > /sys/module/zswap/parameters/enabled > echo y > /sys/module/zswap/parameters/shrinker_enabled > > sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-user=root \ > --mysql-password=1234 --mysql-db=sb --tables=36 --table-size=2000000 \ > --threads=48 --time=600 --report-interval=10 run > > Before: transactions: 1662.90 per sec > After: transactions: 1726.52 per sec 3.8% improvement. Good! > Test 4: > Mysql (16g buffer pool, NVME SWAP, 4G memcg, Zswap enabled, THP always) > echo always > /sys/kernel/mm/transparent_hugepage/enabled > echo 100 > /sys/module/zswap/parameters/max_pool_percent > echo 1 > /sys/module/zswap/parameters/enabled > echo y > /sys/module/zswap/parameters/shrinker_enabled > > sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-user=root \ > --mysql-password=1234 --mysql-db=sb --tables=36 --table-size=2000000 \ > --threads=48 --time=600 --report-interval=10 run > > Before: transactions: 2860.90 per sec. > After: transactions: 2802.55 per sec. > > Test 5: > Memtier / memcached (16G brd SWAP, 8G memcg, THP never): > > memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 16 -B binary & > > memtier_benchmark -S /tmp/memcached.socket \ > -P memcache_binary -n allkeys --key-minimum=1 \ > --key-maximum=24000000 --key-pattern=P:P -c 1 -t 16 \ > --ratio 1:0 --pipeline 8 -d 1000 > > Before: 106730.31 Ops/sec > After: 106360.11 Ops/sec > > Test 5: > Memtier / memcached (16G brd SWAP, 8G memcg, THP always): > > memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 16 -B binary & > > memtier_benchmark -S /tmp/memcached.socket \ > -P memcache_binary -n allkeys --key-minimum=1 \ > --key-maximum=24000000 --key-pattern=P:P -c 1 -t 16 \ > --ratio 1:0 --pipeline 8 -d 1000 > > Before: 83193.11 Ops/sec > After: 82504.89 Ops/sec > > These tests are tested under heavy memory stress, and the performance > seems basically same as before,very slightly better/worse for certain > cases, the benefits of multi-index are basically erased by > fragmentation and workingset nodes usage is slightly lower. > > Some (maybe further) TODO items if we are OK with this approach: > > - I see a slight performance regression for THP tests, > could identify a clear hotspot with perf, my guess is the > content on the xa_lock is an issue (we have a xa_lock for > every 64M swap cache space), THP handling needs to take the lock > longer than usual. splitting the xa_lock to be more > fine-grained seems a good solution. We have > SWAP_ADDRESS_SPACE_SHIFT = 14 which is not an optimal value. > Considering XA_CHUNK_SHIFT is 6, we will have three layer of Xarray > just for 2 extra bits. 12 should be better to always make use of > the whole XA chunk and having two layers at most. But duplicated > address_space struct also wastes more memory and cacheline. > I see an observable performance drop (~3%) after change > SWAP_ADDRESS_SPACE_SHIFT to 12. Might be a good idea to > decouple swap cache xarray from address_space (there are > too many user for swapcache, shouldn't come too dirty). > > - Actually after patch Patch 4/10, the performance is much better for > tests limited with memory cgroup, until 10/10 applied the direct swap > cache freeing logic for SWP_SYNCHRONOUS_IO swapin. Because if the swap > device is not near full, swapin doesn't clear up the swapcache, so > repeated swapout doesn't need to re-alloc a swap entry, make things > faster. This may indicate that lazy freeing of swap cache could benifit > certain workloads and may worth looking into later. > > - Now SWP_SYNCHRONOUS_IO swapin will bypass readahead and force drop > swap cache after swapin is done, which can be cleaned up and optimized > further after this patch. Device type will only determine the > readahead logic, and swap cache drop check can be based purely on swap > count. > > - Recent mTHP swapin/swapout series should have no fundamental > conflict with this. > > Kairui Song (10): > mm/filemap: split filemap storing logic into a standalone helper > mm/swap: move no readahead swapin code to a stand-alone helper > mm/swap: convert swapin_readahead to return a folio > mm/swap: remove cache bypass swapin > mm/swap: clean shadow only in unmap path > mm/swap: switch to use multi index entries > mm/swap: rename __read_swap_cache_async to swap_cache_alloc_or_get > mm/swap: use swap cache as a synchronization layer > mm/swap: delay the swap cache look up for swapin > mm/swap: optimize synchronous swapin > > include/linux/swapops.h | 5 +- > mm/filemap.c | 161 +++++++++----- > mm/huge_memory.c | 78 +++---- > mm/internal.h | 2 + > mm/memory.c | 133 ++++------- > mm/shmem.c | 44 ++-- > mm/swap.h | 71 ++++-- > mm/swap_state.c | 478 +++++++++++++++++++++------------------- > mm/swapfile.c | 64 +++--- > mm/vmscan.c | 8 +- > mm/workingset.c | 2 +- > mm/zswap.c | 4 +- > 12 files changed, 540 insertions(+), 510 deletions(-) -- Best Regards, Huang, Ying