hi, Yu Zhao, On Tue, Dec 24, 2024 at 12:04:44PM -0700, Yu Zhao wrote: > On Mon, Dec 23, 2024 at 04:44:44PM +0800, kernel test robot wrote: > > > > > > Hello, > > > > kernel test robot noticed a 5.7% regression of will-it-scale.per_process_ops on: > > Thanks, Oliver! > > > commit: 3b7734aa8458b62ecbfd785ca7918e831565006e ("[PATCH mm-unstable v3 6/6] mm/mglru: rework workingset protection") > > url: https://github.com/intel-lab-lkp/linux/commits/Yu-Zhao/mm-mglru-clean-up-workingset/20241208-061714 > > base: v6.13-rc1 > > patch link: https://lore.kernel.org/all/20241207221522.2250311-7-yuzhao@xxxxxxxxxx/ > > patch subject: [PATCH mm-unstable v3 6/6] mm/mglru: rework workingset protection > > > > testcase: will-it-scale > > config: x86_64-rhel-9.4 > > compiler: gcc-12 > > test machine: 104 threads 2 sockets (Skylake) with 192G memory > > parameters: > > > > nr_task: 100% > > mode: process > > test: pread2 > > cpufreq_governor: performance > > I think this is very likely caused by my change to folio_mark_accessed() > that unncessarily dirties cache lines shared between different cores. > > Could you try the following fix please? yes, this patch can recover the performance fully (as below (1)). thanks! Tested-by: kernel test robot <oliver.sang@xxxxxxxxx> ========================================================================================= compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase: gcc-12/performance/x86_64-rhel-9.4/process/100%/debian-12-x86_64-20240206.cgz/lkp-skl-fpga01/pread2/will-it-scale commit: 4a202aca7c ("mm/mglru: rework refault detection") 3b7734aa84 ("mm/mglru: rework workingset protection") c5346da9fe <-- fix patch from you 4a202aca7c7d9f99 3b7734aa8458b62ecbfd785ca79 c5346da9fe00d3b303057d93fd9 ---------------- --------------------------- --------------------------- %stddev %change %stddev %change %stddev \ | \ | \ 1.03 ± 3% -0.1 0.92 ± 5% -0.0 0.98 ± 6% mpstat.cpu.all.usr% 144371 -0.5% 143667 ± 2% -2.0% 141486 vmstat.system.in 335982 -60.7% 132060 ± 15% -61.7% 128640 ± 14% proc-vmstat.nr_active_anon 335982 -60.7% 132060 ± 15% -61.7% 128640 ± 14% proc-vmstat.nr_zone_active_anon 1343709 -60.7% 528460 ± 15% -61.7% 514494 ± 14% meminfo.Active 1343709 -60.7% 528460 ± 15% -61.7% 514494 ± 14% meminfo.Active(anon) 259.96 +3.2e+05% 821511 ± 11% +3.2e+05% 829732 ± 9% meminfo.Inactive 1401961 -5.7% 1321692 ± 2% -0.1% 1399905 will-it-scale.104.processes 13479 -5.7% 12708 ± 2% -0.1% 13460 will-it-scale.per_process_ops <----- (1) 1401961 -5.7% 1321692 ± 2% -0.1% 1399905 will-it-scale.workload 138691 ± 43% -75.8% 33574 ± 55% -54.9% 62588 ± 61% numa-vmstat.node0.nr_active_anon 138691 ± 43% -75.8% 33574 ± 55% -54.9% 62588 ± 61% numa-vmstat.node0.nr_zone_active_anon 197311 ± 30% -50.1% 98494 ± 18% -66.5% 66034 ± 50% numa-vmstat.node1.nr_active_anon 197311 ± 30% -50.1% 98494 ± 18% -66.5% 66034 ± 50% numa-vmstat.node1.nr_zone_active_anon 0.29 ± 14% +20.8% 0.35 ± 7% -14.6% 0.25 ± 31% perf-sched.sch_delay.avg.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone 1.02 ± 21% +50.7% 1.54 ± 23% -10.2% 0.92 ± 19% perf-sched.sch_delay.max.ms.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread 476.63 ± 10% -12.7% 415.87 ± 28% -31.2% 327.79 ± 35% perf-sched.wait_and_delay.avg.ms.schedule_hrtimeout_range.ep_poll.do_epoll_wait.__x64_sys_epoll_wait 476.50 ± 10% -12.7% 415.80 ± 28% -31.2% 327.69 ± 35% perf-sched.wait_time.avg.ms.schedule_hrtimeout_range.ep_poll.do_epoll_wait.__x64_sys_epoll_wait 554600 ± 43% -75.8% 134360 ± 55% -54.8% 250416 ± 61% numa-meminfo.node0.Active 554600 ± 43% -75.8% 134360 ± 55% -54.8% 250416 ± 61% numa-meminfo.node0.Active(anon) 173.31 ± 70% +1.4e+05% 247821 ± 50% +1.9e+05% 338038 ± 45% numa-meminfo.node0.Inactive 789291 ± 30% -50.1% 394029 ± 18% -66.5% 264180 ± 50% numa-meminfo.node1.Active 789291 ± 30% -50.1% 394029 ± 18% -66.5% 264180 ± 50% numa-meminfo.node1.Active(anon) 86.66 ±141% +6.6e+05% 573998 ± 27% +5.7e+05% 491639 ± 33% numa-meminfo.node1.Inactive 2.657e+09 -2.2% 2.598e+09 ± 2% -2.4% 2.592e+09 ± 2% perf-stat.i.branch-instructions 1.156e+10 -2.3% 1.13e+10 ± 2% -2.5% 1.127e+10 ± 2% perf-stat.i.instructions 0.01 ± 50% -66.9% 0.00 ± 82% -72.9% 0.00 ±110% perf-stat.i.major-faults 2.648e+09 -18.7% 2.152e+09 ± 44% -2.4% 2.584e+09 ± 2% perf-stat.ps.branch-instructions 1.152e+10 -18.8% 9.358e+09 ± 44% -2.5% 1.123e+10 ± 2% perf-stat.ps.instructions 0.01 ± 50% -73.6% 0.00 ±112% -72.8% 0.00 ±110% perf-stat.ps.major-faults 38.95 -0.9 38.09 +0.0 38.96 perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.folio_wait_bit_common.shmem_get_folio_gfp.shmem_file_read_iter.vfs_read 38.83 -0.9 37.97 +0.0 38.84 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irq.folio_wait_bit_common.shmem_get_folio_gfp.shmem_file_read_iter 39.70 -0.8 38.86 +0.0 39.71 perf-profile.calltrace.cycles-pp.folio_wait_bit_common.shmem_get_folio_gfp.shmem_file_read_iter.vfs_read.__x64_sys_pread64 41.03 -0.8 40.26 +0.0 41.04 perf-profile.calltrace.cycles-pp.shmem_get_folio_gfp.shmem_file_read_iter.vfs_read.__x64_sys_pread64.do_syscall_64 0.91 +0.0 0.95 -0.0 0.91 ± 2% perf-profile.calltrace.cycles-pp.filemap_get_entry.shmem_get_folio_gfp.shmem_file_read_iter.vfs_read.__x64_sys_pread64 53.14 +0.5 53.66 -0.0 53.13 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.folio_wake_bit.shmem_file_read_iter.vfs_read 53.24 +0.5 53.76 -0.0 53.23 perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.folio_wake_bit.shmem_file_read_iter.vfs_read.__x64_sys_pread64 53.84 +0.5 54.38 -0.0 53.82 perf-profile.calltrace.cycles-pp.folio_wake_bit.shmem_file_read_iter.vfs_read.__x64_sys_pread64.do_syscall_64 38.96 -0.9 38.09 +0.0 38.96 perf-profile.children.cycles-pp._raw_spin_lock_irq 39.71 -0.8 38.87 +0.0 39.72 perf-profile.children.cycles-pp.folio_wait_bit_common 41.04 -0.8 40.26 +0.0 41.05 perf-profile.children.cycles-pp.shmem_get_folio_gfp 92.00 -0.3 91.67 -0.0 92.00 perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath 0.22 -0.0 0.18 ± 3% -0.0 0.22 ± 3% perf-profile.children.cycles-pp._copy_to_iter 0.22 ± 2% -0.0 0.19 ± 2% -0.0 0.22 ± 2% perf-profile.children.cycles-pp.copy_page_to_iter 0.20 ± 2% -0.0 0.16 ± 4% -0.0 0.19 ± 2% perf-profile.children.cycles-pp.rep_movs_alternative 0.91 +0.0 0.96 -0.0 0.91 ± 2% perf-profile.children.cycles-pp.filemap_get_entry 0.00 +0.3 0.35 +0.0 0.01 ±299% perf-profile.children.cycles-pp.folio_mark_accessed 53.27 +0.5 53.80 -0.0 53.26 perf-profile.children.cycles-pp._raw_spin_lock_irqsave 53.86 +0.5 54.40 -0.0 53.84 perf-profile.children.cycles-pp.folio_wake_bit 92.00 -0.3 91.67 -0.0 92.00 perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath 0.19 -0.0 0.16 ± 3% +0.0 0.19 ± 2% perf-profile.self.cycles-pp.rep_movs_alternative 0.41 +0.0 0.44 +0.0 0.41 ± 3% perf-profile.self.cycles-pp.shmem_get_folio_gfp 0.37 ± 2% +0.0 0.40 +0.0 0.38 ± 2% perf-profile.self.cycles-pp.folio_wait_bit_common 0.90 +0.0 0.94 -0.0 0.90 ± 2% perf-profile.self.cycles-pp.filemap_get_entry 0.61 +0.1 0.68 +0.0 0.61 ± 2% perf-profile.self.cycles-pp.shmem_file_read_iter 0.00 +0.3 0.34 ± 2% +0.0 0.00 perf-profile.self.cycles-pp.folio_mark_accessed > > diff --git a/mm/swap.c b/mm/swap.c > index 062c8565b899..54bce14fef30 100644 > --- a/mm/swap.c > +++ b/mm/swap.c > @@ -395,7 +395,8 @@ static void lru_gen_inc_refs(struct folio *folio) > > do { > if ((old_flags & LRU_REFS_MASK) == LRU_REFS_MASK) { > - folio_set_workingset(folio); > + if (!folio_test_workingset(folio)) > + folio_set_workingset(folio); > return; > } >