On Thu, Oct 24, 2024 at 11:08 AM Huang, Ying <ying.huang@xxxxxxxxx> wrote: > > Kairui Song <ryncsn@xxxxxxxxx> writes: > > > On Wed, Oct 23, 2024 at 10:27 AM Huang, Ying <ying.huang@xxxxxxxxx> wrote: > >> > >> Hi, Kairui, > > > > Hi Ying, > > > >> > >> Kairui Song <ryncsn@xxxxxxxxx> writes: > >> > >> > From: Kairui Song <kasong@xxxxxxxxxxx> > >> > > >> > This series improved the swap allocator performance greatly by reworking > >> > the locking design and simplify a lot of code path. > >> > > >> > This is follow up of previous swap cluster allocator series: > >> > https://lore.kernel.org/linux-mm/20240730-swap-allocator-v5-0-cb9c148b9297@xxxxxxxxxx/ > >> > > >> > And this series is based on an follow up fix of the swap cluster > >> > allocator: > >> > https://lore.kernel.org/linux-mm/20241022175512.10398-1-ryncsn@xxxxxxxxx/ > >> > > >> > This is part of the new swap allocator work item discussed in > >> > Chris's "Swap Abstraction" discussion at LSF/MM 2024, and > >> > "mTHP and swap allocator" discussion at LPC 2024. > >> > > >> > Previous series introduced a fully cluster based allocation algorithm, > >> > this series completely get rid of the old allocation path and makes the > >> > allocator avoid grabbing the si->lock unless needed. This bring huge > >> > performance gain and get rid of slot cache on freeing path. > >> > >> Great! > >> > >> > Currently, swap locking is mainly composed of two locks, cluster lock > >> > (ci->lock) and device lock (si->lock). The device lock is widely used > >> > to protect many things, causing it to be the main bottleneck for SWAP. > >> > >> Device lock can be confusing with another device lock for struct device. > >> Better to call it swap device lock? > > > > Good idea, I'll use the term swap device lock then. > > > >> > >> > Cluster lock is much more fine-grained, so it will be best to use > >> > ci->lock instead of si->lock as much as possible. > >> > > >> > `perf lock` indicates this issue clearly. Doing linux kernel build > >> > using tmpfs and ZRAM with limited memory (make -j64 with 1G memcg and 4k > >> > pages), result of "perf lock contention -ab sleep 3": > >> > > >> > contended total wait max wait avg wait type caller > >> > > >> > 34948 53.63 s 7.11 ms 1.53 ms spinlock free_swap_and_cache_nr+0x350 > >> > 16569 40.05 s 6.45 ms 2.42 ms spinlock get_swap_pages+0x231 > >> > 11191 28.41 s 7.03 ms 2.54 ms spinlock swapcache_free_entries+0x59 > >> > 4147 22.78 s 122.66 ms 5.49 ms spinlock page_vma_mapped_walk+0x6f3 > >> > 4595 7.17 s 6.79 ms 1.56 ms spinlock swapcache_free_entries+0x59 > >> > 406027 2.74 s 2.59 ms 6.74 us spinlock list_lru_add+0x39 > >> > ...snip... > >> > > >> > The top 5 caller are all users of si->lock, total wait time up sums to > >> > several minutes in the 3 seconds time window. > >> > >> Can you show results of `perf record -g`, `perf report -g` too? I have > >> interest to check hot spot shifting too. > > > > Sure. I think `perf lock` result is already good enough and cleaner. > > My test environment are mostly VM based so spinlock slow path may get > > offloaded to host, and can't be see by perf record, I collected > > following data after disabled paravirt spinlock: > > > > The time consumption and stack trace of a page fault before: > > - 78.45% 0.17% cc1 [kernel.kallsyms] > > [k] asm_exc_page_fault > > - 78.28% asm_exc_page_fault > > - 78.18% exc_page_fault > > - 78.17% do_user_addr_fault > > - 78.09% handle_mm_fault > > - 78.06% __handle_mm_fault > > - 69.69% do_swap_page > > - 55.87% alloc_swap_folio > > - 55.60% mem_cgroup_swapin_charge_folio > > - 55.48% charge_memcg > > - 55.45% try_charge_memcg > > - 55.36% try_to_free_mem_cgroup_pages > > - do_try_to_free_pages > > - 55.35% shrink_node > > - 55.27% shrink_lruvec > > - 55.13% try_to_shrink_lruvec > > - 54.79% evict_folios > > - 54.35% shrink_folio_list > > - 30.01% add_to_swap > > - 29.77% > > folio_alloc_swap > > - 29.50% > > get_swap_pages > > > > 25.03% queued_spin_lock_slowpath > > - 2.71% > > alloc_swap_scan_cluster > > > > 1.80% queued_spin_lock_slowpath > > + > > 0.89% __try_to_reclaim_swap > > - 1.74% > > swap_reclaim_full_clusters > > > > 1.74% queued_spin_lock_slowpath > > - 10.88% > > try_to_unmap_flush_dirty > > - 10.87% > > arch_tlbbatch_flush > > - 10.85% > > on_each_cpu_cond_mask > > > > smp_call_function_many_cond > > + 7.45% pageout > > + 2.71% try_to_unmap_flush > > + 1.90% try_to_unmap > > + 0.78% folio_referenced > > - 9.41% cluster_swap_free_nr > > - 9.39% free_swap_slot > > - 9.35% swapcache_free_entries > > 8.40% queued_spin_lock_slowpath > > 0.93% swap_entry_range_free > > - 3.61% swap_read_folio_bdev_sync > > - 3.55% submit_bio_wait > > - 3.51% submit_bio_noacct_nocheck > > + 3.46% __submit_bio > > + 7.71% do_pte_missing > > + 0.61% wp_page_copy > > > > The queued_spin_lock_slowpath above is the si->lock, and there are > > multiple users of it so the total overhead is higher than shown. > > > > After: > > - 75.05% 0.43% cc1 [kernel.kallsyms] > > [k] asm_exc_page_fault > > - 74.62% asm_exc_page_fault > > - 74.36% exc_page_fault > > - 74.34% do_user_addr_fault > > - 74.10% handle_mm_fault > > - 73.96% __handle_mm_fault > > - 67.55% do_swap_page > > - 45.92% alloc_swap_folio > > - 45.03% mem_cgroup_swapin_charge_folio > > - 44.58% charge_memcg > > - 44.44% try_charge_memcg > > - 44.12% try_to_free_mem_cgroup_pages > > - do_try_to_free_pages > > - 44.10% shrink_node > > - 43.86% shrink_lruvec > > - 41.92% try_to_shrink_lruvec > > - 40.67% evict_folios > > - 37.12% shrink_folio_list > > - 20.88% pageout > > + 20.02% swap_writepage > > + 0.72% shmem_writepage > > - 4.08% add_to_swap > > - 2.48% > > folio_alloc_swap > > - 2.12% > > __mem_cgroup_try_charge_swap > > - 1.47% > > swap_cgroup_record > > + > > 1.32% _raw_spin_lock_irqsave > > - 1.56% > > add_to_swap_cache > > - 1.04% xas_store > > + 1.01% > > workingset_update_node > > + 3.97% > > try_to_unmap_flush_dirty > > + 3.51% folio_referenced > > + 2.24% __remove_mapping > > + 1.16% try_to_unmap > > + 0.52% try_to_unmap_flush > > 2.50% > > queued_spin_lock_slowpath > > 0.79% scan_folios > > + 1.20% try_to_inc_max_seq > > + 1.92% lru_add_drain > > + 0.73% vma_alloc_folio_noprof > > - 9.81% swap_read_folio_bdev_sync > > - 9.61% submit_bio_wait > > + 9.49% submit_bio_noacct_nocheck > > - 8.06% cluster_swap_free_nr > > - 8.02% swap_entry_range_free > > + 3.92% __mem_cgroup_uncharge_swap > > + 2.90% zram_slot_free_notify > > 0.58% clear_shadow_from_swap_cache > > - 1.32% __folio_batch_add_and_move > > - 1.30% folio_batch_move_lru > > + 1.10% folio_lruvec_lock_irqsave > > Thanks for data. > > It seems that the cycles shifts from spinning to memory compression. > That is expected. > > > spin_lock usage is much lower. > > > > I prefer the perf lock output as it shows the exact time and user of locks. > > perf cycles data is more complete. You can find which part becomes new > hot spot. > > >> > >> > Following the new allocator design, many operation doesn't need to touch > >> > si->lock at all. We only need to take si->lock when doing operations > >> > across multiple clusters (eg. changing the cluster list), other > >> > operations only need to take ci->lock. So ideally allocator should > >> > always take ci->lock first, then, if needed, take si->lock. But due > >> > to historical reasons, ci->lock is used inside si->lock by design, > >> > causing lock inversion if we simply try to acquire si->lock after > >> > acquiring ci->lock. > >> > > >> > This series audited all si->lock usage, simplify legacy codes, eliminate > >> > usage of si->lock as much as possible by introducing new designs based > >> > on the new cluster allocator. > >> > > >> > Old HDD allocation codes are removed, cluster allocator is adapted > >> > with small changes for HDD usage, test is looking OK. > >> > >> I think that it's a good idea to remove HDD allocation specific code. > >> Can you check the performance of swapping to HDD? However, I understand > >> that many people have no HDD in hand. > > > > It's not hard to make cluster allocator work well with HDD in theory, > > see the commit "mm, swap: use a global swap cluster for non-rotation > > device". > > The testing is not very reliable though, I found HDD swap performance > > is very unstable because of the IO pattern of HDD, so it's just a best > > effort try. > > Just to check whether code change cause something too bad for HDD. No > measurable difference is a good news. > > >> > And this also removed slot cache for freeing path. The performance is > >> > better without it, and this enables other clean up and optimizations > >> > as discussed before: > >> > https://lore.kernel.org/all/CAMgjq7ACohT_uerSz8E_994ZZCv709Zor+43hdmesW_59W1BWw@xxxxxxxxxxxxxx/ > >> > > >> > After this series, lock contention on si->lock is nearly unobservable > >> > with `perf lock` with the same test above : > >> > > >> > contended total wait max wait avg wait type caller > >> > ... snip ... > >> > 91 204.62 us 4.51 us 2.25 us spinlock cluster_move+0x2e > >> > ... snip ... > >> > 47 125.62 us 4.47 us 2.67 us spinlock cluster_move+0x2e > >> > ... snip ... > >> > 23 63.15 us 3.95 us 2.74 us spinlock cluster_move+0x2e > >> > ... snip ... > >> > 17 41.26 us 4.58 us 2.43 us spinlock cluster_isolate_lock+0x1d > >> > ... snip ... > >> > > >> > cluster_move and cluster_isolate_lock are basically the only users > >> > of si->lock now, performance gain is huge with reduced LOC. > >> > > >> > Tests > >> > === > >> > > >> > Build kernel with defconfig on tmpfs with ZRAM as swap: > >> > --- > >> > > >> > Running a test matrix which is scaled up progressive for a intuitive result. > >> > The test are ran on top of tmpfs, using memory cgroup for memory limitation, > >> > on a 48c96t system. > >> > > >> > 12 test run for each case, it can be seen clearly that as concurrent job > >> > number goes higher the performance gain is higher, the performance is > >> > higher even with low concurrency. > >> > > >> > make -j<NR> | System Time (seconds) | Total Time (seconds) > >> > (NR / Mem / ZRAM) | (Before / After / Delta) | (Before / After / Delta) > >> > With 4k pages only: > >> > 6 / 192M / 3G | 5258 / 5235 / -0.3% | 1420 / 1414 / -0.3% > >> > 12 / 256M / 4G | 5518 / 5337 / -3.3% | 758 / 742 / -2.1% > >> > 24 / 384M / 5G | 7091 / 5766 / -18.7% | 476 / 422 / -11.3% > >> > 48 / 768M / 7G | 11139 / 5831 / -47.7% | 330 / 221 / -33.0% > >> > 96 / 1.5G / 10G | 21303 / 11353 / -46.7% | 283 / 180 / -36.4% > >> > With 64k mTHP: > >> > 24 / 512M / 5G | 5104 / 4641 / -18.7% | 376 / 358 / -4.8% > >> > 48 / 1G / 7G | 8693 / 4662 / -18.7% | 257 / 176 / -31.5% > >> > 96 / 2G / 10G | 17056 / 10263 / -39.8% | 234 / 169 / -27.8% > >> > >> How much is the swap in/out throughput before/after the change? > > > > This may not be too beneficial for typical throughput measurement: > > - For example doing the same test with brd will only show a ~20% > > performance improvement, still a big gain though. I think the si->lock > > spinlock wasting CPU cycles may effect CPU sensitive things like ZRAM > > even more. > > 20% is a good data. You don't need to guess. perf cycles profiling can > show the hot spot. > > > - And simple benchmarks which just do multiple sequential swaps in/out > > in multiple thread hardly stress the allocator. > > > > I haven't found a good > > benchmark to simulate random parallel IOs on SWAP yet, I can write one > > later. > > I have used anon-w-rand test case of vm-scalability to simulate random > parallel swap out. > > https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-anon-w-rand > > > A more close to real word benchmark like build kernel test, or > > mysql/sysbench all showed great improment. > > Yes. Real work load is good. We can use micro-benchmark to find out > some performance limit, for example, max possible throughput. > > >> > >> When I worked on swap in/out performance before, the hot spot shifts from > >> swap related code to LRU lock and zone lock. Things may change a lot > >> now. > >> > >> If zram is used as swap device, the hot spot may become > >> compression/decompression after solving the swap lock contention. To > >> stress swap subsystem further, we may use a ram disk as swap. > >> Previously, we have used a simulated pmem device (backed by DRAM). That > >> can be setup as in, > >> > >> https://pmem.io/blog/2016/02/how-to-emulate-persistent-memory/ > >> > >> After creating the raw block device: /dev/pmem0, we can do > >> > >> $ mkswap /dev/pmem0 > >> $ swapon /dev/pmem0 > >> > >> Can you use something similar if necessary? > > > > I used to test with brd, as described above, > > brd will allocate memory during running, pmem can avoid that. perf > profile is your friends to root cause the possible issue. > > > I think using ZRAM with > > test simulating real workload is more useful. > > Yes. And, as I said before. Micro-benchmark has its own value. Hi Ying, Thank you very much for the suggestion, I didn't mean I'm against micro benchmarks in any way, just a lot of effort was spent on other tests so I skipped that part for V1. As you mentioned vm-scalability, I think this is definitely a good idea to include that test when pmem simulation. There are still some bottlenecks of SWAP, beside compression and page fault / tlb, mostly cgroup lock and list lru locks. I have some ideas to optimize these too, could be next steps. > > And I did include a Sequential SWAP test, the result is looking OK (no > > regression, minor to none improvement). > > Good. At least we have no regression here. > > -- > Best Regards, > Huang, Ying