From: Kairui Song <kasong@xxxxxxxxxxx> This series tries to unify and clean up the swapin path, introduce minor optimization, and make both shmem swapoff make use of SWP_SYNCHRONOUS_IO flag to skip readahead and swapcache for better performance. Test results: - swap out 10G zero-filled data to ZRAM then read them in: Before: 11143285 us After: 10692644 us (+4.1%) - swapping off a 10G ZRAM (lzo-rle) after same workload: Before: time swapoff /dev/zram0 real 0m12.337s user 0m0.001s sys 0m12.329s After: time swapoff /dev/zram0 real 0m9.728s user 0m0.001s sys 0m9.719s - shmem FIO test 1 on a Ryzen 5900HX: fio -name=tmpfs --numjobs=16 --directory=/tmpfs --size=960m \ --ioengine=mmap --rw=randread --random_distribution=zipf:0.5 \ --time_based --ramp_time=1m --runtime=5m --group_reporting (using brd as swap, 2G memcg limit) Before: bw ( MiB/s): min= 1167, max= 1732, per=100.00%, avg=1460.82, stdev= 4.38, samples=9536 iops : min=298938, max=443557, avg=373964.41, stdev=1121.27, samples=9536 After (+3.5%): bw ( MiB/s): min= 1285, max= 1738, per=100.00%, avg=1512.88, stdev= 4.34, samples=9456 iops : min=328957, max=445105, avg=387294.21, stdev=1111.15, samples=9456 - shmem FIO test 2 on a Ryzen 5900HX: fio -name=tmpfs --numjobs=16 --directory=/tmpfs --size=960m \ --ioengine=mmap --rw=randread --random_distribution=zipf:1.2 \ --time_based --ramp_time=1m --runtime=5m --group_reporting (using brd as swap, 2G memcg limit) Before: bw ( MiB/s): min= 5296, max= 7112, per=100.00%, avg=6131.93, stdev=17.09, samples=9536 iops : min=1355934, max=1820833, avg=1569769.11, stdev=4375.93, samples=9536 After (+3.1%): bw ( MiB/s): min= 5466, max= 7173, per=100.00%, avg=6324.51, stdev=16.66, samples=9521 iops : min=1399355, max=1836435, avg=1619068.90, stdev=4263.94, samples=9521 - Some built objects are very slightly smaller (gcc 13.2.1): ./scripts/bloat-o-meter ./vmlinux ./vmlinux.new add/remove: 4/2 grow/shrink: 1/10 up/down: 818/-983 (-165) Function old new delta swapin_entry - 482 +482 mm_counter - 248 +248 shmem_swapin_folio 1412 1468 +56 __pfx_swapin_entry - 16 +16 __pfx_mm_counter - 16 +16 __read_swap_cache_async 738 736 -2 copy_present_pte 1258 1249 -9 mem_cgroup_swapin_charge_folio 297 285 -12 __pfx_swapin_readahead 16 - -16 swap_cache_get_folio 364 345 -19 do_anonymous_page 1488 1458 -30 unuse_pte_range 889 833 -56 free_p4d_range 524 446 -78 restore_exclusive_pte 937 822 -115 do_swap_page 2969 2817 -152 swapin_readahead 239 - -239 copy_nonpresent_pte 1478 1223 -255 Total: Before=26056243, After=26056078, chg -0.00% V2: https://lore.kernel.org/linux-mm/20240102175338.62012-1-ryncsn@xxxxxxxxx/ Update from V2: - Many code path clean up (merge swapin_entry with swapin_entry_mpol, drop second param of mem_cgroup_swapin_charge_folio, swapin_entry takes a pointer to folio as return value instaed of pointer to boolean to reduce LOC and logic), thanks for Huang, Ying. - Don't use cluster readhead for swapoff, the performance is worse than VMA readahead for NVME. - Add a refactor patch for swap_cache_get_folio. V1: https://lore.kernel.org/linux-mm/20231119194740.94101-1-ryncsn@xxxxxxxxx/T/ Update from V1: - Rebased based on mm-unstable. - Remove behaviour changing patches, will submit in seperate series later. - Code style, naming and comments updates. - Thanks to Chris Li for very detailed and helpful review of V1. Thanks to Matthew Wilcox and Huang Ying for helpful suggestions. Kairui Song (7): mm/swapfile.c: add back some comment mm/swap: move no readahead swapin code to a stand-alone helper mm/swap: always account swapped in page into current memcg mm/swap: introduce swapin_entry for unified readahead policy mm/swap: avoid a duplicated swap cache lookup for SWP_SYNCHRONOUS_IO mm/swap, shmem: use unified swapin helper for shmem mm/swap: refactor swap_cache_get_folio include/linux/memcontrol.h | 4 +- mm/memcontrol.c | 5 +- mm/memory.c | 45 ++-------- mm/shmem.c | 50 +++++++---- mm/swap.h | 23 ++--- mm/swap_state.c | 176 ++++++++++++++++++++++++++----------- mm/swapfile.c | 20 +++-- 7 files changed, 190 insertions(+), 133 deletions(-) -- 2.43.0