Re: [linus:master] [mm] cacded5e42: aim9.brk_test.ops_per_sec -5.0% regression

Oliver Sang <oliver.sang@xxxxxxxxx> · Tue, 8 Oct 2024 16:31:59 +0800

hi, Lorenzo Stoakes,

sorry for late, we are in holidays last week.

On Mon, Sep 30, 2024 at 09:21:52AM +0100, Lorenzo Stoakes wrote:
> On Mon, Sep 30, 2024 at 10:21:27AM GMT, kernel test robot wrote:
> >
> >
> > Hello,
> >
> > kernel test robot noticed a -5.0% regression of aim9.brk_test.ops_per_sec on:
> >
> >
> > commit: cacded5e42b9609b07b22d80c10f0076d439f7d1 ("mm: avoid using vma_merge() for new VMAs")
> > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> >
> > testcase: aim9
> > test machine: 48 threads 2 sockets Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz (Ivy Bridge-EP) with 64G memory
> 
> Hm, quite an old microarchitecture no?
> 
> Would it be possible to try this on a range of uarch's, especially more
> recent noes, with some repeated runs to rule out statistical noise? Much
> appreciated!

we run this test on below platforms, and observed similar regression.
one thing I want to mention is for performance tests, we run one commit at least
6 times. for this aim9 test, the data is quite stable, so there is no %stddev
value in our table. we won't show this value if it's <2%

(1)

model: Granite Rapids
nr_node: 1
nr_cpu: 240
memory: 192G

=========================================================================================
compiler/cpufreq_governor/kconfig/rootfs/tbox_group/test/testcase/testtime:
  gcc-12/performance/x86_64-rhel-8.3/debian-12-x86_64-20240206.cgz/lkp-gnr-1ap1/brk_test/aim9/300s

fc21959f74bc1138 cacded5e42b9609b07b22d80c10
---------------- ---------------------------
         %stddev     %change         %stddev
             \          |                \
   3220697            -6.0%    3028867        aim9.brk_test.ops_per_sec

(2)

model: Emerald Rapids
nr_node: 4
nr_cpu: 256
memory: 256G
brand: INTEL(R) XEON(R) PLATINUM 8592+

=========================================================================================
compiler/cpufreq_governor/kconfig/rootfs/tbox_group/test/testcase/testtime:
  gcc-12/performance/x86_64-rhel-8.3/debian-12-x86_64-20240206.cgz/lkp-emr-2sp1/brk_test/aim9/300s

fc21959f74bc1138 cacded5e42b9609b07b22d80c10
---------------- ---------------------------
         %stddev     %change         %stddev
             \          |                \
   3669298            -6.5%    3430070        aim9.brk_test.ops_per_sec

(3)

model: Sapphire Rapids
nr_node: 2
nr_cpu: 224
memory: 512G
brand: Intel(R) Xeon(R) Platinum 8480CTDX

=========================================================================================
compiler/cpufreq_governor/kconfig/rootfs/tbox_group/test/testcase/testtime:
  gcc-12/performance/x86_64-rhel-8.3/debian-12-x86_64-20240206.cgz/lkp-spr-2sp4/brk_test/aim9/300s

fc21959f74bc1138 cacded5e42b9609b07b22d80c10
---------------- ---------------------------
         %stddev     %change         %stddev
             \          |                \
   3540976            -6.4%    3314159        aim9.brk_test.ops_per_sec

(4)

model: Ice Lake
nr_node: 2
nr_cpu: 64
memory: 256G
brand: Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz

=========================================================================================
compiler/cpufreq_governor/kconfig/rootfs/tbox_group/test/testcase/testtime:
  gcc-12/performance/x86_64-rhel-8.3/debian-12-x86_64-20240206.cgz/lkp-icl-2sp9/brk_test/aim9/300s

fc21959f74bc1138 cacded5e42b9609b07b22d80c10
---------------- ---------------------------
         %stddev     %change         %stddev
             \          |                \
   2667734            -5.6%    2518021        aim9.brk_test.ops_per_sec

> 
> > parameters:
> >
> > 	testtime: 300s
> > 	test: brk_test
> > 	cpufreq_governor: performance
> >
> >
> >
> >
> > If you fix the issue in a separate patch/commit (i.e. not just a new version of
> > the same patch/commit), kindly add following tags
> > | Reported-by: kernel test robot <oliver.sang@xxxxxxxxx>
> > | Closes: https://lore.kernel.org/oe-lkp/202409301043.629bea78-oliver.sang@xxxxxxxxx
> >
> >
> > Details are as below:
> > -------------------------------------------------------------------------------------------------->
> >
> >
> > The kernel config and materials to reproduce are available at:
> > https://download.01.org/0day-ci/archive/20240930/202409301043.629bea78-oliver.sang@xxxxxxxxx
> >
> > =========================================================================================
> > compiler/cpufreq_governor/kconfig/rootfs/tbox_group/test/testcase/testtime:
> >   gcc-12/performance/x86_64-rhel-8.3/debian-12-x86_64-20240206.cgz/lkp-ivb-2ep2/brk_test/aim9/300s
> >
> > commit:
> >   fc21959f74 ("mm: abstract vma_expand() to use vma_merge_struct")
> >   cacded5e42 ("mm: avoid using vma_merge() for new VMAs")
> 
> Yup this results in a different code path for brk(), but local testing
> indicated no regression (a prior revision of the series had encountered
> one, so I carefully assessed this, found the bug, and noted no clear
> regression after this - but a lot of variance in the numbers).
> 
> >
> > fc21959f74bc1138 cacded5e42b9609b07b22d80c10
> > ---------------- ---------------------------
> >          %stddev     %change         %stddev
> >              \          |                \
> >    1322908            -5.0%    1256536        aim9.brk_test.ops_per_sec
> 
> Unfortunate there's no stddev figure here, and 5% feels borderline on noise
> - as above it'd be great to get some multiple runs going to rule out
> noise. Thanks!

as above mentioned, the reason there is no %stddev here is it's <2%

just list raw data FYI.

for cacded5e42b9609b07b22d80c10

  "aim9.brk_test.ops_per_sec": [
    1268030.0,
    1277110.76,
    1226452.45,
    1275850.0,
    1249628.35,
    1242148.6
  ],

for fc21959f74bc1138

  "aim9.brk_test.ops_per_sec": [
    1351624.95,
    1316322.79,
    1330363.33,
    1289563.33,
    1314100.0,
    1335475.48
  ],

> 
> >     201.54            +2.9%     207.44        aim9.time.system_time
> >      97.58            -6.0%      91.75        aim9.time.user_time
> >       0.04 ± 82%    -100.0%       0.00        perf-sched.sch_delay.avg.ms.__cond_resched.down_write.do_brk_flags.__do_sys_brk.do_syscall_64
> >       0.10 ± 60%    -100.0%       0.00        perf-sched.sch_delay.max.ms.__cond_resched.down_write.do_brk_flags.__do_sys_brk.do_syscall_64
> >       0.04 ± 82%    -100.0%       0.00        perf-sched.wait_time.avg.ms.__cond_resched.down_write.do_brk_flags.__do_sys_brk.do_syscall_64
> >       0.10 ± 60%    -100.0%       0.00        perf-sched.wait_time.max.ms.__cond_resched.down_write.do_brk_flags.__do_sys_brk.do_syscall_64
> >   8.33e+08            +3.9%  8.654e+08        perf-stat.i.branch-instructions
> >       1.15            -0.1        1.09        perf-stat.i.branch-miss-rate%
> >   12964626            -1.9%   12711922        perf-stat.i.branch-misses
> >       1.11            -7.4%       1.03        perf-stat.i.cpi
> >  3.943e+09            +6.0%   4.18e+09        perf-stat.i.instructions
> >       0.91            +7.9%       0.98        perf-stat.i.ipc
> >       0.29 ±  2%      -9.1%       0.27 ±  4%  perf-stat.overall.MPKI
> >       1.56            -0.1        1.47        perf-stat.overall.branch-miss-rate%
> >       1.08            -6.8%       1.01        perf-stat.overall.cpi
> >       0.92            +7.2%       0.99        perf-stat.overall.ipc
> >  8.303e+08            +3.9%  8.627e+08        perf-stat.ps.branch-instructions
> >   12931205            -2.0%   12678170        perf-stat.ps.branch-misses
> >   3.93e+09            +6.0%  4.167e+09        perf-stat.ps.instructions
> >  1.184e+12            +6.1%  1.256e+12        perf-stat.total.instructions
> >       7.16 ±  2%      -0.4        6.76 ±  4%  perf-profile.calltrace.cycles-pp.entry_SYSRETQ_unsafe_stack.brk
> >       5.72 ±  2%      -0.4        5.35 ±  3%  perf-profile.calltrace.cycles-pp.perf_event_mmap_event.perf_event_mmap.do_brk_flags.__do_sys_brk.do_syscall_64
> >       6.13 ±  2%      -0.3        5.84 ±  3%  perf-profile.calltrace.cycles-pp.perf_event_mmap.do_brk_flags.__do_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe
> >       0.83 ± 11%      -0.1        0.71 ±  5%  perf-profile.calltrace.cycles-pp.__vm_enough_memory.do_brk_flags.__do_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe
> >       0.00            +0.6        0.58 ±  5%  perf-profile.calltrace.cycles-pp.mas_leaf_max_gap.mas_update_gap.mas_store_prealloc.vma_expand.vma_merge_new_range
> >      16.73 ±  2%      +0.6       17.34        perf-profile.calltrace.cycles-pp.do_brk_flags.__do_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe.brk
> >       0.00            +0.7        0.66 ±  6%  perf-profile.calltrace.cycles-pp.mas_wr_store_type.mas_preallocate.vma_expand.vma_merge_new_range.do_brk_flags
> >      24.21            +0.7       24.90        perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.brk
> >      23.33            +0.7       24.05 ±  2%  perf-profile.calltrace.cycles-pp.__do_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe.brk
> >       0.00            +0.8        0.82 ±  4%  perf-profile.calltrace.cycles-pp.vma_complete.vma_expand.vma_merge_new_range.do_brk_flags.__do_sys_brk
> >       0.00            +0.9        0.87 ±  5%  perf-profile.calltrace.cycles-pp.mas_update_gap.mas_store_prealloc.vma_expand.vma_merge_new_range.do_brk_flags
> >       0.00            +1.1        1.07 ±  9%  perf-profile.calltrace.cycles-pp.vma_prepare.vma_expand.vma_merge_new_range.do_brk_flags.__do_sys_brk
> >       0.00            +1.1        1.10 ±  6%  perf-profile.calltrace.cycles-pp.mas_preallocate.vma_expand.vma_merge_new_range.do_brk_flags.__do_sys_brk
> >       0.00            +2.3        2.26 ±  5%  perf-profile.calltrace.cycles-pp.mas_store_prealloc.vma_expand.vma_merge_new_range.do_brk_flags.__do_sys_brk
> >       0.00            +7.6        7.56 ±  3%  perf-profile.calltrace.cycles-pp.vma_expand.vma_merge_new_range.do_brk_flags.__do_sys_brk.do_syscall_64
> >       0.00            +8.6        8.62 ±  4%  perf-profile.calltrace.cycles-pp.vma_merge_new_range.do_brk_flags.__do_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe
> >       7.74 ±  2%      -0.4        7.30 ±  4%  perf-profile.children.cycles-pp.entry_SYSRETQ_unsafe_stack
> >       5.81 ±  2%      -0.4        5.43 ±  3%  perf-profile.children.cycles-pp.perf_event_mmap_event
> >       6.18 ±  2%      -0.3        5.88 ±  3%  perf-profile.children.cycles-pp.perf_event_mmap
> >       3.93            -0.2        3.73 ±  3%  perf-profile.children.cycles-pp.perf_iterate_sb
> >       0.22 ± 29%      -0.1        0.08 ± 17%  perf-profile.children.cycles-pp.may_expand_vm
> >       0.96 ±  3%      -0.1        0.83 ±  4%  perf-profile.children.cycles-pp.vma_complete
> >       0.61 ± 14%      -0.1        0.52 ±  7%  perf-profile.children.cycles-pp.percpu_counter_add_batch
> >       0.15 ±  7%      -0.1        0.08 ± 20%  perf-profile.children.cycles-pp.brk_test
> >       0.08 ± 11%      +0.0        0.12 ± 14%  perf-profile.children.cycles-pp.mas_prev_setup
> >       0.17 ± 12%      +0.1        0.27 ± 10%  perf-profile.children.cycles-pp.mas_wr_store_entry
> >       0.00            +0.2        0.15 ± 11%  perf-profile.children.cycles-pp.mas_next_range
> >       0.19 ±  8%      +0.2        0.38 ± 10%  perf-profile.children.cycles-pp.mas_next_slot
> >       0.34 ± 17%      +0.3        0.64 ±  6%  perf-profile.children.cycles-pp.mas_prev_slot
> >      23.40            +0.7       24.12 ±  2%  perf-profile.children.cycles-pp.__do_sys_brk
> >       0.00            +7.6        7.59 ±  3%  perf-profile.children.cycles-pp.vma_expand
> >       0.00            +8.7        8.66 ±  4%  perf-profile.children.cycles-pp.vma_merge_new_range
> >       1.61 ± 10%      -0.9        0.69 ±  8%  perf-profile.self.cycles-pp.do_brk_flags
> >       7.64 ±  2%      -0.4        7.20 ±  4%  perf-profile.self.cycles-pp.entry_SYSRETQ_unsafe_stack
> >       0.22 ± 30%      -0.1        0.08 ± 17%  perf-profile.self.cycles-pp.may_expand_vm
> >       0.57 ± 15%      -0.1        0.46 ±  6%  perf-profile.self.cycles-pp.percpu_counter_add_batch
> >       0.15 ±  7%      -0.1        0.08 ± 20%  perf-profile.self.cycles-pp.brk_test
> >       0.20 ±  5%      -0.0        0.18 ±  4%  perf-profile.self.cycles-pp.anon_vma_interval_tree_insert
> >       0.07 ± 18%      +0.0        0.10 ± 18%  perf-profile.self.cycles-pp.mas_prev_setup
> >       0.00            +0.1        0.09 ± 12%  perf-profile.self.cycles-pp.mas_next_range
> >       0.36 ±  8%      +0.1        0.45 ±  6%  perf-profile.self.cycles-pp.perf_event_mmap
> >       0.15 ± 13%      +0.1        0.25 ± 14%  perf-profile.self.cycles-pp.mas_wr_store_entry
> >       0.17 ± 11%      +0.2        0.37 ± 11%  perf-profile.self.cycles-pp.mas_next_slot
> >       0.34 ± 17%      +0.3        0.64 ±  6%  perf-profile.self.cycles-pp.mas_prev_slot
> >       0.00            +0.3        0.33 ±  5%  perf-profile.self.cycles-pp.vma_merge_new_range
> >       0.00            +0.8        0.81 ±  9%  perf-profile.self.cycles-pp.vma_expand
> >
> >
> >
> >
> > Disclaimer:
> > Results have been estimated based on internal Intel analysis and are provided
> > for informational purposes only. Any difference in system hardware or software
> > design or configuration may affect actual performance.
> >
> >
> > --
> > 0-DAY CI Kernel Test Service
> > https://github.com/intel/lkp-tests/wiki
> >
> 
> Overall, previously we special-cased brk() to avoid regression, but the
> special-casing is horribly duplicative and bug-prone so, while we can
> revert to doing that again, I'd really, really like to avoid it if we
> possibly can :)