Re: [PATCH mm-unstable v3 6/6] mm/mglru: rework workingset protection

Oliver Sang <oliver.sang@xxxxxxxxx> · Thu, 26 Dec 2024 10:51:16 +0800

hi, Yu Zhao,

On Tue, Dec 24, 2024 at 12:04:44PM -0700, Yu Zhao wrote:
> On Mon, Dec 23, 2024 at 04:44:44PM +0800, kernel test robot wrote:
> > 
> > 
> > Hello,
> > 
> > kernel test robot noticed a 5.7% regression of will-it-scale.per_process_ops on:
> 
> Thanks, Oliver!
> 
> > commit: 3b7734aa8458b62ecbfd785ca7918e831565006e ("[PATCH mm-unstable v3 6/6] mm/mglru: rework workingset protection")
> > url: https://github.com/intel-lab-lkp/linux/commits/Yu-Zhao/mm-mglru-clean-up-workingset/20241208-061714
> > base: v6.13-rc1
> > patch link: https://lore.kernel.org/all/20241207221522.2250311-7-yuzhao@xxxxxxxxxx/
> > patch subject: [PATCH mm-unstable v3 6/6] mm/mglru: rework workingset protection
> > 
> > testcase: will-it-scale
> > config: x86_64-rhel-9.4
> > compiler: gcc-12
> > test machine: 104 threads 2 sockets (Skylake) with 192G memory
> > parameters:
> > 
> > 	nr_task: 100%
> > 	mode: process
> > 	test: pread2
> > 	cpufreq_governor: performance
> 
> I think this is very likely caused by my change to folio_mark_accessed()
> that unncessarily dirties cache lines shared between different cores.
> 
> Could you try the following fix please?

yes, this patch can recover the performance fully (as below (1)). thanks!

Tested-by: kernel test robot <oliver.sang@xxxxxxxxx>

=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
  gcc-12/performance/x86_64-rhel-9.4/process/100%/debian-12-x86_64-20240206.cgz/lkp-skl-fpga01/pread2/will-it-scale

commit:
  4a202aca7c ("mm/mglru: rework refault detection")
  3b7734aa84 ("mm/mglru: rework workingset protection")
  c5346da9fe  <-- fix patch from you

4a202aca7c7d9f99 3b7734aa8458b62ecbfd785ca79 c5346da9fe00d3b303057d93fd9
---------------- --------------------------- ---------------------------
         %stddev     %change         %stddev     %change         %stddev
             \          |                \          |                \
      1.03 ±  3%      -0.1        0.92 ±  5%      -0.0        0.98 ±  6%  mpstat.cpu.all.usr%
    144371            -0.5%     143667 ±  2%      -2.0%     141486        vmstat.system.in
    335982           -60.7%     132060 ± 15%     -61.7%     128640 ± 14%  proc-vmstat.nr_active_anon
    335982           -60.7%     132060 ± 15%     -61.7%     128640 ± 14%  proc-vmstat.nr_zone_active_anon
   1343709           -60.7%     528460 ± 15%     -61.7%     514494 ± 14%  meminfo.Active
   1343709           -60.7%     528460 ± 15%     -61.7%     514494 ± 14%  meminfo.Active(anon)
    259.96        +3.2e+05%     821511 ± 11%  +3.2e+05%     829732 ±  9%  meminfo.Inactive
   1401961            -5.7%    1321692 ±  2%      -0.1%    1399905        will-it-scale.104.processes
     13479            -5.7%      12708 ±  2%      -0.1%      13460        will-it-scale.per_process_ops    <----- (1)
   1401961            -5.7%    1321692 ±  2%      -0.1%    1399905        will-it-scale.workload
    138691 ± 43%     -75.8%      33574 ± 55%     -54.9%      62588 ± 61%  numa-vmstat.node0.nr_active_anon
    138691 ± 43%     -75.8%      33574 ± 55%     -54.9%      62588 ± 61%  numa-vmstat.node0.nr_zone_active_anon
    197311 ± 30%     -50.1%      98494 ± 18%     -66.5%      66034 ± 50%  numa-vmstat.node1.nr_active_anon
    197311 ± 30%     -50.1%      98494 ± 18%     -66.5%      66034 ± 50%  numa-vmstat.node1.nr_zone_active_anon
      0.29 ± 14%     +20.8%       0.35 ±  7%     -14.6%       0.25 ± 31%  perf-sched.sch_delay.avg.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone
      1.02 ± 21%     +50.7%       1.54 ± 23%     -10.2%       0.92 ± 19%  perf-sched.sch_delay.max.ms.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread
    476.63 ± 10%     -12.7%     415.87 ± 28%     -31.2%     327.79 ± 35%  perf-sched.wait_and_delay.avg.ms.schedule_hrtimeout_range.ep_poll.do_epoll_wait.__x64_sys_epoll_wait
    476.50 ± 10%     -12.7%     415.80 ± 28%     -31.2%     327.69 ± 35%  perf-sched.wait_time.avg.ms.schedule_hrtimeout_range.ep_poll.do_epoll_wait.__x64_sys_epoll_wait
    554600 ± 43%     -75.8%     134360 ± 55%     -54.8%     250416 ± 61%  numa-meminfo.node0.Active
    554600 ± 43%     -75.8%     134360 ± 55%     -54.8%     250416 ± 61%  numa-meminfo.node0.Active(anon)
    173.31 ± 70%  +1.4e+05%     247821 ± 50%  +1.9e+05%     338038 ± 45%  numa-meminfo.node0.Inactive
    789291 ± 30%     -50.1%     394029 ± 18%     -66.5%     264180 ± 50%  numa-meminfo.node1.Active
    789291 ± 30%     -50.1%     394029 ± 18%     -66.5%     264180 ± 50%  numa-meminfo.node1.Active(anon)
     86.66 ±141%  +6.6e+05%     573998 ± 27%  +5.7e+05%     491639 ± 33%  numa-meminfo.node1.Inactive
 2.657e+09            -2.2%  2.598e+09 ±  2%      -2.4%  2.592e+09 ±  2%  perf-stat.i.branch-instructions
 1.156e+10            -2.3%   1.13e+10 ±  2%      -2.5%  1.127e+10 ±  2%  perf-stat.i.instructions
      0.01 ± 50%     -66.9%       0.00 ± 82%     -72.9%       0.00 ±110%  perf-stat.i.major-faults
 2.648e+09           -18.7%  2.152e+09 ± 44%      -2.4%  2.584e+09 ±  2%  perf-stat.ps.branch-instructions
 1.152e+10           -18.8%  9.358e+09 ± 44%      -2.5%  1.123e+10 ±  2%  perf-stat.ps.instructions
      0.01 ± 50%     -73.6%       0.00 ±112%     -72.8%       0.00 ±110%  perf-stat.ps.major-faults
     38.95            -0.9       38.09            +0.0       38.96        perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.folio_wait_bit_common.shmem_get_folio_gfp.shmem_file_read_iter.vfs_read
     38.83            -0.9       37.97            +0.0       38.84        perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irq.folio_wait_bit_common.shmem_get_folio_gfp.shmem_file_read_iter
     39.70            -0.8       38.86            +0.0       39.71        perf-profile.calltrace.cycles-pp.folio_wait_bit_common.shmem_get_folio_gfp.shmem_file_read_iter.vfs_read.__x64_sys_pread64
     41.03            -0.8       40.26            +0.0       41.04        perf-profile.calltrace.cycles-pp.shmem_get_folio_gfp.shmem_file_read_iter.vfs_read.__x64_sys_pread64.do_syscall_64
      0.91            +0.0        0.95            -0.0        0.91 ±  2%  perf-profile.calltrace.cycles-pp.filemap_get_entry.shmem_get_folio_gfp.shmem_file_read_iter.vfs_read.__x64_sys_pread64
     53.14            +0.5       53.66            -0.0       53.13        perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.folio_wake_bit.shmem_file_read_iter.vfs_read
     53.24            +0.5       53.76            -0.0       53.23        perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.folio_wake_bit.shmem_file_read_iter.vfs_read.__x64_sys_pread64
     53.84            +0.5       54.38            -0.0       53.82        perf-profile.calltrace.cycles-pp.folio_wake_bit.shmem_file_read_iter.vfs_read.__x64_sys_pread64.do_syscall_64
     38.96            -0.9       38.09            +0.0       38.96        perf-profile.children.cycles-pp._raw_spin_lock_irq
     39.71            -0.8       38.87            +0.0       39.72        perf-profile.children.cycles-pp.folio_wait_bit_common
     41.04            -0.8       40.26            +0.0       41.05        perf-profile.children.cycles-pp.shmem_get_folio_gfp
     92.00            -0.3       91.67            -0.0       92.00        perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
      0.22            -0.0        0.18 ±  3%      -0.0        0.22 ±  3%  perf-profile.children.cycles-pp._copy_to_iter
      0.22 ±  2%      -0.0        0.19 ±  2%      -0.0        0.22 ±  2%  perf-profile.children.cycles-pp.copy_page_to_iter
      0.20 ±  2%      -0.0        0.16 ±  4%      -0.0        0.19 ±  2%  perf-profile.children.cycles-pp.rep_movs_alternative
      0.91            +0.0        0.96            -0.0        0.91 ±  2%  perf-profile.children.cycles-pp.filemap_get_entry
      0.00            +0.3        0.35            +0.0        0.01 ±299%  perf-profile.children.cycles-pp.folio_mark_accessed
     53.27            +0.5       53.80            -0.0       53.26        perf-profile.children.cycles-pp._raw_spin_lock_irqsave
     53.86            +0.5       54.40            -0.0       53.84        perf-profile.children.cycles-pp.folio_wake_bit
     92.00            -0.3       91.67            -0.0       92.00        perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
      0.19            -0.0        0.16 ±  3%      +0.0        0.19 ±  2%  perf-profile.self.cycles-pp.rep_movs_alternative
      0.41            +0.0        0.44            +0.0        0.41 ±  3%  perf-profile.self.cycles-pp.shmem_get_folio_gfp
      0.37 ±  2%      +0.0        0.40            +0.0        0.38 ±  2%  perf-profile.self.cycles-pp.folio_wait_bit_common
      0.90            +0.0        0.94            -0.0        0.90 ±  2%  perf-profile.self.cycles-pp.filemap_get_entry
      0.61            +0.1        0.68            +0.0        0.61 ±  2%  perf-profile.self.cycles-pp.shmem_file_read_iter
      0.00            +0.3        0.34 ±  2%      +0.0        0.00        perf-profile.self.cycles-pp.folio_mark_accessed

> 
> diff --git a/mm/swap.c b/mm/swap.c
> index 062c8565b899..54bce14fef30 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -395,7 +395,8 @@ static void lru_gen_inc_refs(struct folio *folio)
>  
>  	do {
>  		if ((old_flags & LRU_REFS_MASK) == LRU_REFS_MASK) {
> -			folio_set_workingset(folio);
> +			if (!folio_test_workingset(folio))
> +				folio_set_workingset(folio);
>  			return;
>  		}
>