Re: [linus:master] [mm] cacded5e42: aim9.brk_test.ops_per_sec -5.0% regression

Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> · Mon, 30 Sep 2024 09:21:52 +0100

On Mon, Sep 30, 2024 at 10:21:27AM GMT, kernel test robot wrote:
>
>
> Hello,
>
> kernel test robot noticed a -5.0% regression of aim9.brk_test.ops_per_sec on:
>
>
> commit: cacded5e42b9609b07b22d80c10f0076d439f7d1 ("mm: avoid using vma_merge() for new VMAs")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>
> testcase: aim9
> test machine: 48 threads 2 sockets Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz (Ivy Bridge-EP) with 64G memory

Hm, quite an old microarchitecture no?

Would it be possible to try this on a range of uarch's, especially more
recent noes, with some repeated runs to rule out statistical noise? Much
appreciated!

> parameters:
>
> 	testtime: 300s
> 	test: brk_test
> 	cpufreq_governor: performance
>
>
>
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <oliver.sang@xxxxxxxxx>
> | Closes: https://lore.kernel.org/oe-lkp/202409301043.629bea78-oliver.sang@xxxxxxxxx
>
>
> Details are as below:
> -------------------------------------------------------------------------------------------------->
>
>
> The kernel config and materials to reproduce are available at:
> https://download.01.org/0day-ci/archive/20240930/202409301043.629bea78-oliver.sang@xxxxxxxxx
>
> =========================================================================================
> compiler/cpufreq_governor/kconfig/rootfs/tbox_group/test/testcase/testtime:
>   gcc-12/performance/x86_64-rhel-8.3/debian-12-x86_64-20240206.cgz/lkp-ivb-2ep2/brk_test/aim9/300s
>
> commit:
>   fc21959f74 ("mm: abstract vma_expand() to use vma_merge_struct")
>   cacded5e42 ("mm: avoid using vma_merge() for new VMAs")

Yup this results in a different code path for brk(), but local testing
indicated no regression (a prior revision of the series had encountered
one, so I carefully assessed this, found the bug, and noted no clear
regression after this - but a lot of variance in the numbers).

>
> fc21959f74bc1138 cacded5e42b9609b07b22d80c10
> ---------------- ---------------------------
>          %stddev     %change         %stddev
>              \          |                \
>    1322908            -5.0%    1256536        aim9.brk_test.ops_per_sec

Unfortunate there's no stddev figure here, and 5% feels borderline on noise
- as above it'd be great to get some multiple runs going to rule out
noise. Thanks!

>     201.54            +2.9%     207.44        aim9.time.system_time
>      97.58            -6.0%      91.75        aim9.time.user_time
>       0.04 ± 82%    -100.0%       0.00        perf-sched.sch_delay.avg.ms.__cond_resched.down_write.do_brk_flags.__do_sys_brk.do_syscall_64
>       0.10 ± 60%    -100.0%       0.00        perf-sched.sch_delay.max.ms.__cond_resched.down_write.do_brk_flags.__do_sys_brk.do_syscall_64
>       0.04 ± 82%    -100.0%       0.00        perf-sched.wait_time.avg.ms.__cond_resched.down_write.do_brk_flags.__do_sys_brk.do_syscall_64
>       0.10 ± 60%    -100.0%       0.00        perf-sched.wait_time.max.ms.__cond_resched.down_write.do_brk_flags.__do_sys_brk.do_syscall_64
>   8.33e+08            +3.9%  8.654e+08        perf-stat.i.branch-instructions
>       1.15            -0.1        1.09        perf-stat.i.branch-miss-rate%
>   12964626            -1.9%   12711922        perf-stat.i.branch-misses
>       1.11            -7.4%       1.03        perf-stat.i.cpi
>  3.943e+09            +6.0%   4.18e+09        perf-stat.i.instructions
>       0.91            +7.9%       0.98        perf-stat.i.ipc
>       0.29 ±  2%      -9.1%       0.27 ±  4%  perf-stat.overall.MPKI
>       1.56            -0.1        1.47        perf-stat.overall.branch-miss-rate%
>       1.08            -6.8%       1.01        perf-stat.overall.cpi
>       0.92            +7.2%       0.99        perf-stat.overall.ipc
>  8.303e+08            +3.9%  8.627e+08        perf-stat.ps.branch-instructions
>   12931205            -2.0%   12678170        perf-stat.ps.branch-misses
>   3.93e+09            +6.0%  4.167e+09        perf-stat.ps.instructions
>  1.184e+12            +6.1%  1.256e+12        perf-stat.total.instructions
>       7.16 ±  2%      -0.4        6.76 ±  4%  perf-profile.calltrace.cycles-pp.entry_SYSRETQ_unsafe_stack.brk
>       5.72 ±  2%      -0.4        5.35 ±  3%  perf-profile.calltrace.cycles-pp.perf_event_mmap_event.perf_event_mmap.do_brk_flags.__do_sys_brk.do_syscall_64
>       6.13 ±  2%      -0.3        5.84 ±  3%  perf-profile.calltrace.cycles-pp.perf_event_mmap.do_brk_flags.__do_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe
>       0.83 ± 11%      -0.1        0.71 ±  5%  perf-profile.calltrace.cycles-pp.__vm_enough_memory.do_brk_flags.__do_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe
>       0.00            +0.6        0.58 ±  5%  perf-profile.calltrace.cycles-pp.mas_leaf_max_gap.mas_update_gap.mas_store_prealloc.vma_expand.vma_merge_new_range
>      16.73 ±  2%      +0.6       17.34        perf-profile.calltrace.cycles-pp.do_brk_flags.__do_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe.brk
>       0.00            +0.7        0.66 ±  6%  perf-profile.calltrace.cycles-pp.mas_wr_store_type.mas_preallocate.vma_expand.vma_merge_new_range.do_brk_flags
>      24.21            +0.7       24.90        perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.brk
>      23.33            +0.7       24.05 ±  2%  perf-profile.calltrace.cycles-pp.__do_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe.brk
>       0.00            +0.8        0.82 ±  4%  perf-profile.calltrace.cycles-pp.vma_complete.vma_expand.vma_merge_new_range.do_brk_flags.__do_sys_brk
>       0.00            +0.9        0.87 ±  5%  perf-profile.calltrace.cycles-pp.mas_update_gap.mas_store_prealloc.vma_expand.vma_merge_new_range.do_brk_flags
>       0.00            +1.1        1.07 ±  9%  perf-profile.calltrace.cycles-pp.vma_prepare.vma_expand.vma_merge_new_range.do_brk_flags.__do_sys_brk
>       0.00            +1.1        1.10 ±  6%  perf-profile.calltrace.cycles-pp.mas_preallocate.vma_expand.vma_merge_new_range.do_brk_flags.__do_sys_brk
>       0.00            +2.3        2.26 ±  5%  perf-profile.calltrace.cycles-pp.mas_store_prealloc.vma_expand.vma_merge_new_range.do_brk_flags.__do_sys_brk
>       0.00            +7.6        7.56 ±  3%  perf-profile.calltrace.cycles-pp.vma_expand.vma_merge_new_range.do_brk_flags.__do_sys_brk.do_syscall_64
>       0.00            +8.6        8.62 ±  4%  perf-profile.calltrace.cycles-pp.vma_merge_new_range.do_brk_flags.__do_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe
>       7.74 ±  2%      -0.4        7.30 ±  4%  perf-profile.children.cycles-pp.entry_SYSRETQ_unsafe_stack
>       5.81 ±  2%      -0.4        5.43 ±  3%  perf-profile.children.cycles-pp.perf_event_mmap_event
>       6.18 ±  2%      -0.3        5.88 ±  3%  perf-profile.children.cycles-pp.perf_event_mmap
>       3.93            -0.2        3.73 ±  3%  perf-profile.children.cycles-pp.perf_iterate_sb
>       0.22 ± 29%      -0.1        0.08 ± 17%  perf-profile.children.cycles-pp.may_expand_vm
>       0.96 ±  3%      -0.1        0.83 ±  4%  perf-profile.children.cycles-pp.vma_complete
>       0.61 ± 14%      -0.1        0.52 ±  7%  perf-profile.children.cycles-pp.percpu_counter_add_batch
>       0.15 ±  7%      -0.1        0.08 ± 20%  perf-profile.children.cycles-pp.brk_test
>       0.08 ± 11%      +0.0        0.12 ± 14%  perf-profile.children.cycles-pp.mas_prev_setup
>       0.17 ± 12%      +0.1        0.27 ± 10%  perf-profile.children.cycles-pp.mas_wr_store_entry
>       0.00            +0.2        0.15 ± 11%  perf-profile.children.cycles-pp.mas_next_range
>       0.19 ±  8%      +0.2        0.38 ± 10%  perf-profile.children.cycles-pp.mas_next_slot
>       0.34 ± 17%      +0.3        0.64 ±  6%  perf-profile.children.cycles-pp.mas_prev_slot
>      23.40            +0.7       24.12 ±  2%  perf-profile.children.cycles-pp.__do_sys_brk
>       0.00            +7.6        7.59 ±  3%  perf-profile.children.cycles-pp.vma_expand
>       0.00            +8.7        8.66 ±  4%  perf-profile.children.cycles-pp.vma_merge_new_range
>       1.61 ± 10%      -0.9        0.69 ±  8%  perf-profile.self.cycles-pp.do_brk_flags
>       7.64 ±  2%      -0.4        7.20 ±  4%  perf-profile.self.cycles-pp.entry_SYSRETQ_unsafe_stack
>       0.22 ± 30%      -0.1        0.08 ± 17%  perf-profile.self.cycles-pp.may_expand_vm
>       0.57 ± 15%      -0.1        0.46 ±  6%  perf-profile.self.cycles-pp.percpu_counter_add_batch
>       0.15 ±  7%      -0.1        0.08 ± 20%  perf-profile.self.cycles-pp.brk_test
>       0.20 ±  5%      -0.0        0.18 ±  4%  perf-profile.self.cycles-pp.anon_vma_interval_tree_insert
>       0.07 ± 18%      +0.0        0.10 ± 18%  perf-profile.self.cycles-pp.mas_prev_setup
>       0.00            +0.1        0.09 ± 12%  perf-profile.self.cycles-pp.mas_next_range
>       0.36 ±  8%      +0.1        0.45 ±  6%  perf-profile.self.cycles-pp.perf_event_mmap
>       0.15 ± 13%      +0.1        0.25 ± 14%  perf-profile.self.cycles-pp.mas_wr_store_entry
>       0.17 ± 11%      +0.2        0.37 ± 11%  perf-profile.self.cycles-pp.mas_next_slot
>       0.34 ± 17%      +0.3        0.64 ±  6%  perf-profile.self.cycles-pp.mas_prev_slot
>       0.00            +0.3        0.33 ±  5%  perf-profile.self.cycles-pp.vma_merge_new_range
>       0.00            +0.8        0.81 ±  9%  perf-profile.self.cycles-pp.vma_expand
>
>
>
>
> Disclaimer:
> Results have been estimated based on internal Intel analysis and are provided
> for informational purposes only. Any difference in system hardware or software
> design or configuration may affect actual performance.
>
>
> --
> 0-DAY CI Kernel Test Service
> https://github.com/intel/lkp-tests/wiki
>

Overall, previously we special-cased brk() to avoid regression, but the
special-casing is horribly duplicative and bug-prone so, while we can
revert to doing that again, I'd really, really like to avoid it if we
possibly can :)