On Wed, Apr 14, 2021 at 7:36 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > On Wed, Apr 14, 2021 at 01:16:52AM -0600, Yu Zhao wrote: > > On Tue, Apr 13, 2021 at 10:50 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > > On Tue, Apr 13, 2021 at 09:40:12PM -0600, Yu Zhao wrote: > > > > On Tue, Apr 13, 2021 at 5:14 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > > > > Profiles would be interesting, because it sounds to me like reclaim > > > > > *might* be batching page cache removal better (e.g. fewer, larger > > > > > batches) and so spending less time contending on the mapping tree > > > > > lock... > > > > > > > > > > IOWs, I suspect this result might actually be a result of less lock > > > > > contention due to a change in batch processing characteristics of > > > > > the new algorithm rather than it being a "better" algorithm... > > > > > > > > I appreciate the profile. But there is no batching in > > > > __remove_mapping() -- it locks the mapping for each page, and > > > > therefore the lock contention penalizes the mainline and this patchset > > > > equally. It looks worse on your system because the four kswapd threads > > > > from different nodes were working on the same file. > > > > > > I think you misunderstand exactly what I mean by "batching" here. > > > I'm not talking about doing multiple pieces of work under a single > > > lock. What I mean is that the overall amount of work done in a > > > single reclaim scan (i.e a "reclaim batch") is packaged differently. > > > > > > We already batch up page reclaim via building a page list and then > > > passing it to shrink_page_list() to process the batch of pages in a > > > single pass. Each page in this page list batch then calls > > > remove_mapping() to pull the page form the LRU, we have a run of > > > contention between the foreground read() thread and the background > > > kswapd. > > > > > > If the size or nature of the pages in the batch passed to > > > shrink_page_list() changes, then the amount of time a reclaim batch > > > is going to put pressure on the mapping tree lock will also change. > > > That's the "change in batching behaviour" I'm referring to here. I > > > haven't read through the patchset to determine if you change the > > > shrink_page_list() algorithm, but it likely changes what is passed > > > to be reclaimed and that in turn changes the locking patterns that > > > fall out of shrink_page_list... > > > > Ok, if we are talking about the size of the batch passed to > > shrink_page_list(), both the mainline and this patchset cap it at > > SWAP_CLUSTER_MAX, which is 32. There are corner cases, but when > > running fio/io_uring, it's safe to say both use 32. > > You're still looking at micro-scale behaviour, not the larger-scale > batching effects. Are we passing SWAP_CLUSTER_MAX groups of pages to > shrinker_page_list() at a different rate? > > When I say "batch of work" when talking about the page cache cycling > *500 thousand pages a second* through the cache, I'm not talking > about batches of 32 pages. I'm talking about the entire batch of > work kswapd does in an invocation cycle. > > Is it scanning 100k pages 10 times a second? or 10k pages a hundred > times a second? How long does a batch take to run? how long does is > sleep between processing batches? Is there any change in these > metrics as a result of the multi-gen LRU patches? Hi Dave, Sorry for the late reply. Still catching up on emails. Well, it doesn't really work that way. Yes, I agree that batching theoretically can have effects on the performance but the patchset doesn't change anything in this respect. The number of pages to reclaim is determined by a common code path shared between the existing implementation and this patchset. Specifically, ksawpd sets "sc->nr_to_reclaim" based on the high watermark, and passes "sc" down to both code path: static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) { <snipped> + if (lru_gen_enabled()) { + shrink_lru_gens(lruvec, sc); + return; + } <snipped> And there isn't really any new algorithm. It's just the old plain LRU. The improvement is purely from a feedback loop that helps avoid unnecessary activations and deactivations. By activations, I mean the following work done in the buffered io read path: generic_file_read_iter() filemap_read() mark_page_accessed() activate_page() Simply put, on the second accessed, the current implementation unconditionally moves a page from the inactive lru to the active lru, if it's not already there. And by deactivations, I mean the following work done in kswapd: kswapd_shrink_node() shrink_node() shrink_node_memcgs() shrink_lruvec() shrink_active_list() i.e., kswapd moves activated pages back to the inactive list before it can evict them. For random accesses, every page is equal and none is active. But how can we tell? The refault rates, i.e., evicted and then accessed again, for all pages are the same, no matter how many times they have been accessed. This is exactly how the feedback loop works. > Basically, we're looking at how access to the mapping lock is > changing the contention profile, and whether that is signficant or > not. I suspect it is, because when you have highly contended locks > and you do something external that reduces unrelated lock > contention, it's because that external thing is taking more time to > do and so there's less time to spend hitting locks hard... > > As such, I don't think this test is a good measure of the multi-gen > LRU patches at all - performance is dominated by the severity of > lock contention external to the LRU scanning algorithm, and it's > hard to infer anything through suck lock contention.... The lock contention you saw earlier is because of the four nodes system you used -- each node has a kswapd thread but there is only one file. The four kswapd threads keep banging on the same mapping lock. This can be easily fixed if we just run the same test with a single node *or* with multiple files. Here is what I got (it's basically what I've attached in the previous email). Before the patchset: kswapd # Children Self Symbol # ........ ........ ....................................... 100.00% 0.00% kswapd 99.90% 0.01% balance_pgdat 99.86% 0.05% shrink_node 98.00% 0.19% shrink_lruvec 91.13% 0.10% shrink_inactive_list 75.67% 5.97% shrink_page_list 57.65% 2.59% __remove_mapping 45.99% 0.27% __delete_from_page_cache 42.88% 0.88% page_cache_delete 39.05% 1.04% xas_store 37.71% 37.67% xas_create 12.62% 11.79% isolate_lru_pages 8.52% 0.98% free_unref_page_list 7.38% 0.61% free_unref_page_commit 6.68% 1.78% free_pcppages_bulk 6.53% 0.83% ***shrink_active_list*** 6.45% 3.21% _raw_spin_lock_irqsave 4.82% 4.60% __free_one_page 4.58% 4.21% unlock_page 3.62% 3.55% native_queued_spin_lock_slowpath 2.49% 0.71% unaccount_page_cache_page 2.46% 0.81% workingset_eviction 2.14% 0.33% __mod_lruvec_state 1.97% 1.88% xas_clear_mark 1.73% 0.26% __mod_lruvec_page_state 1.71% 1.06% move_pages_to_lru 1.66% 1.62% workingset_age_nonresident 1.60% 0.85% __mod_memcg_lruvec_state 1.58% 0.02% shrink_slab 1.49% 0.13% mem_cgroup_uncharge_list 1.45% 0.06% do_shrink_slab 1.37% 1.32% page_mapping 1.06% 0.76% count_shadow_nodes io_uring # Children Self Symbol # ........ ........ ........................................ 99.22% 0.00% do_syscall_64 94.48% 0.22% __io_queue_sqe 94.09% 0.31% io_issue_sqe 93.33% 0.62% io_read 88.57% 0.93% blkdev_read_iter 87.80% 0.15% io_iter_do_read 87.57% 0.16% generic_file_read_iter 87.35% 1.08% filemap_read 82.44% 0.01% __x64_sys_io_uring_enter 82.34% 0.01% __do_sys_io_uring_enter 82.09% 0.47% io_submit_sqes 79.50% 0.08% io_queue_sqe 71.47% 1.08% filemap_get_pages 53.08% 0.35% ondemand_readahead 51.74% 1.49% page_cache_ra_unbounded 49.92% 0.13% page_cache_sync_ra 21.26% 0.63% add_to_page_cache_lru 17.08% 0.02% task_work_run 16.98% 0.03% exit_to_user_mode_prepare 16.94% 10.52% __add_to_page_cache_locked 16.94% 0.06% tctx_task_work 16.79% 0.19% syscall_exit_to_user_mode 16.41% 2.61% filemap_get_read_batch 16.08% 15.72% xas_load 15.99% 0.15% read_pages 15.87% 0.13% blkdev_readahead 15.72% 0.56% mpage_readahead 15.58% 0.09% io_req_task_submit 15.37% 0.07% __io_req_task_submit 12.14% 0.15% copy_page_to_iter 12.03% 0.03% ***__page_cache_alloc*** 11.98% 0.14% alloc_pages_current 11.66% 0.30% __alloc_pages_nodemask 11.28% 11.05% _copy_to_iter 11.06% 3.16% get_page_from_freelist 10.53% 0.10% submit_bio 10.27% 0.21% submit_bio_noacct 8.30% 0.51% blk_mq_submit_bio 7.73% 7.62% clear_page_erms 3.68% 1.02% do_mpage_readpage 3.33% 0.01% page_cache_async_ra 3.25% 0.01% blk_flush_plug_list 3.24% 0.06% blk_mq_flush_plug_list 3.18% 0.01% blk_mq_sched_insert_requests 3.15% 0.13% blk_mq_try_issue_list_directly 2.84% 0.15% __blk_mq_try_issue_directly 2.69% 0.58% nvme_queue_rq 2.54% 1.05% blk_attempt_plug_merge 2.53% 0.44% ***mark_page_accessed*** 2.36% 0.12% rw_verify_area 2.16% 0.09% mpage_alloc 2.10% 0.27% lru_cache_add 1.81% 0.27% security_file_permission 1.75% 0.87% __pagevec_lru_add 1.70% 0.63% _raw_spin_lock_irq 1.53% 0.40% xa_get_order 1.52% 0.15% __blk_mq_alloc_request 1.50% 0.65% workingset_refault 1.50% 0.07% activate_page 1.48% 0.29% io_submit_flush_completions 1.46% 0.32% bio_alloc_bioset 1.42% 0.15% xa_load 1.40% 0.16% pagevec_lru_move_fn 1.39% 0.06% blk_finish_plug 1.35% 0.28% submit_bio_checks 1.26% 0.02% asm_common_interrupt 1.25% 0.01% common_interrupt 1.21% 1.21% native_queued_spin_lock_slowpath 1.18% 0.10% mempool_alloc 1.17% 0.00% __common_interrupt 1.14% 0.03% handle_edge_irq 1.08% 0.87% apparmor_file_permission 1.03% 0.19% __mod_lruvec_state 1.02% 0.65% blk_rq_merge_ok 1.01% 0.94% xas_start After the patchset: kswapd # Children Self Symbol # ........ ........ ........................................ 100.00% 0.00% kswapd 99.92% 0.11% balance_pgdat 99.32% 0.03% shrink_node 97.25% 0.32% shrink_lruvec 96.80% 0.09% evict_lru_gen_pages 77.82% 6.28% shrink_page_list 61.61% 2.76% __remove_mapping 50.28% 0.33% __delete_from_page_cache 46.63% 1.08% page_cache_delete 42.20% 1.16% xas_store 40.71% 40.67% xas_create 12.54% 7.76% isolate_lru_gen_pages 6.42% 3.19% _raw_spin_lock_irqsave 6.15% 0.91% free_unref_page_list 5.62% 5.45% unlock_page 5.05% 0.59% free_unref_page_commit 4.35% 2.04% lru_gen_update_size 4.31% 1.41% free_pcppages_bulk 3.43% 3.36% native_queued_spin_lock_slowpath 3.38% 0.59% __mod_lruvec_state 2.97% 0.78% unaccount_page_cache_page 2.82% 2.52% __free_one_page 2.33% 1.18% __mod_memcg_lruvec_state 2.28% 2.17% xas_clear_mark 2.13% 0.30% __mod_lruvec_page_state 1.88% 0.04% shrink_slab 1.82% 1.78% workingset_eviction 1.74% 0.06% do_shrink_slab 1.70% 0.15% mem_cgroup_uncharge_list 1.39% 1.01% count_shadow_nodes 1.22% 1.18% __mod_memcg_state.part.0 1.16% 1.11% page_mapping 1.02% 0.98% xas_init_marks io_uring # Children Self Symbol # ........ ........ ........................................ 99.19% 0.01% entry_SYSCALL_64_after_hwframe 99.16% 0.00% do_syscall_64 94.78% 0.18% __io_queue_sqe 94.41% 0.25% io_issue_sqe 93.60% 0.48% io_read 89.35% 0.96% blkdev_read_iter 88.44% 0.12% io_iter_do_read 88.25% 0.16% generic_file_read_iter 88.00% 1.20% filemap_read 84.01% 0.01% __x64_sys_io_uring_enter 83.91% 0.01% __do_sys_io_uring_enter 83.74% 0.37% io_submit_sqes 81.28% 0.07% io_queue_sqe 74.65% 0.96% filemap_get_pages 55.92% 0.35% ondemand_readahead 54.57% 1.34% page_cache_ra_unbounded 51.57% 0.12% page_cache_sync_ra 24.14% 0.51% add_to_page_cache_lru 19.04% 11.51% __add_to_page_cache_locked 18.48% 0.13% read_pages 18.42% 0.18% blkdev_readahead 18.20% 0.55% mpage_readahead 16.81% 2.31% filemap_get_read_batch 16.37% 14.83% xas_load 15.40% 0.02% task_work_run 15.38% 0.03% exit_to_user_mode_prepare 15.31% 0.05% tctx_task_work 15.14% 0.15% syscall_exit_to_user_mode 14.05% 0.04% io_req_task_submit 13.86% 0.05% __io_req_task_submit 12.92% 0.12% submit_bio 11.40% 0.13% copy_page_to_iter 10.65% 9.61% _copy_to_iter 9.45% 0.03% ***__page_cache_alloc*** 9.42% 0.16% submit_bio_noacct 9.40% 0.11% alloc_pages_current 9.11% 0.30% __alloc_pages_nodemask 8.53% 1.81% get_page_from_freelist 8.38% 0.10% asm_common_interrupt 8.26% 0.06% common_interrupt 7.75% 0.05% __common_interrupt 7.62% 0.44% blk_mq_submit_bio 7.56% 0.20% handle_edge_irq 6.45% 5.90% clear_page_erms 5.25% 0.10% handle_irq_event 4.88% 0.19% nvme_irq 4.83% 0.07% __handle_irq_event_percpu 4.73% 0.52% nvme_process_cq 4.52% 0.01% page_cache_async_ra 4.00% 0.04% nvme_pci_complete_rq 3.82% 0.04% nvme_complete_rq 3.76% 1.11% do_mpage_readpage 3.74% 0.06% blk_mq_end_request 3.03% 0.01% blk_flush_plug_list 3.02% 0.06% blk_mq_flush_plug_list 2.96% 0.01% blk_mq_sched_insert_requests 2.94% 0.10% blk_mq_try_issue_list_directly 2.89% 0.00% __irqentry_text_start 2.71% 0.41% psi_task_change 2.67% 0.21% lru_cache_add 2.65% 0.14% __blk_mq_try_issue_directly 2.53% 0.54% nvme_queue_rq 2.43% 0.17% blk_update_request 2.42% 0.58% __pagevec_lru_add 2.29% 1.42% psi_group_change 2.22% 0.85% blk_attempt_plug_merge 2.14% 0.04% bio_endio 2.13% 0.11% rw_verify_area 2.08% 0.18% mpage_end_io 2.01% 0.08% mpage_alloc 1.71% 0.56% _raw_spin_lock_irq 1.65% 0.98% workingset_refault 1.65% 0.09% psi_memstall_leave 1.64% 0.20% security_file_permission 1.61% 1.59% _raw_spin_lock 1.58% 0.08% psi_memstall_enter 1.44% 0.37% xa_get_order 1.43% 0.13% __blk_mq_alloc_request 1.37% 0.26% io_submit_flush_completions 1.36% 0.14% xa_load 1.34% 0.31% bio_alloc_bioset 1.31% 0.26% submit_bio_checks 1.29% 0.04% blk_finish_plug 1.28% 1.27% native_queued_spin_lock_slowpath 1.24% 0.19% page_endio 1.13% 0.10% unlock_page 1.09% 0.99% read_tsc 1.07% 0.74% lru_gen_addition 1.03% 0.09% mempool_alloc 1.02% 0.13% wake_up_page_bit 1.02% 0.92% xas_start 1.01% 0.78% apparmor_file_permission By comparing the two sets, we can clearly see what's changed: Before the patchset: 6.53% 0.83% ***shrink_active_list*** 12.03% 0.03% ***__page_cache_alloc*** 2.53% 0.44% ***mark_page_accessed*** After the patchset: 9.45% 0.03% ***__page_cache_alloc*** (There are shrink_active_list() or mark_page_accessed() since we don't activate and deactivate pages anymore, for this test case.) Hopefully this is clear enough. But I do see where your skepticism comes from and I don't want to dismiss it out of hand. So if you have any other benchmarks, I'd be happy to try them. What do you think? Thanks.