Hello, this commit fixes the "[linus:master] [file] 0ede61d858: will-it-scale.per_thread_ops -2.9% regression" we reported in https://lore.kernel.org/oe-lkp/202311201406.2022ca3f-oliver.sang@xxxxxxxxx/ in our tests, besides the improvment in will-it-scale tests, we also noticed the improvement in lmbench3 latency tests. so just report as below FYI. kernel test robot noticed a -5.0% improvement of lmbench3.Select.100tcp.latency.us on: commit: 253ca8678d30bcf94410b54476fc1e0f1627a137 ("Improve __fget_files_rcu() code generation (and thus __fget_light())") https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master testcase: lmbench3 test machine: 48 threads 2 sockets Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz (Ivy Bridge-EP) with 112G memory parameters: test_memory_size: 50% nr_threads: 50% mode: development test: SELECT cpufreq_governor: performance In addition to that, the commit also has significant impact on the following tests: +------------------+----------------------------------------------------------------------------------------------------+ | testcase: change | will-it-scale: will-it-scale.per_process_ops 10.3% improvement | | test machine | 224 threads 4 sockets Intel(R) Xeon(R) Platinum 8380H CPU @ 2.90GHz (Cooper Lake) with 192G memory | | test parameters | cpufreq_governor=performance | | | mode=process | | | nr_task=100% | | | test=poll2 | +------------------+----------------------------------------------------------------------------------------------------+ Details are as below: --------------------------------------------------------------------------------------------------> The kernel config and materials to reproduce are available at: https://download.01.org/0day-ci/archive/20231222/202312221056.da0e7f9-oliver.sang@xxxxxxxxx ========================================================================================= compiler/cpufreq_governor/kconfig/mode/nr_threads/rootfs/tbox_group/test/test_memory_size/testcase: gcc-12/performance/x86_64-rhel-8.3/development/50%/debian-11.1-x86_64-20220510.cgz/lkp-ivb-2ep1/SELECT/50%/lmbench3 commit: 7cb537b6f6 ("file: massage cleanup of files that failed to open") 253ca8678d ("Improve __fget_files_rcu() code generation (and thus __fget_light())") 7cb537b6f6d7d652 253ca8678d30bcf94410b54476f ---------------- --------------------------- %stddev %change %stddev \ | \ 1.78 -9.8% 1.61 lmbench3.Select.100fd.latency.us 5.70 -5.0% 5.41 lmbench3.Select.100tcp.latency.us 12.09 ± 36% -12.1 0.00 perf-profile.calltrace.cycles-pp.__fget_light.do_select.core_sys_select.kern_select.__x64_sys_select 0.05 ±299% +14.9 14.97 ± 51% perf-profile.calltrace.cycles-pp.__fdget.do_select.core_sys_select.kern_select.__x64_sys_select 12.09 ± 36% -12.1 0.00 perf-profile.children.cycles-pp.__fget_light 0.36 ± 42% +14.6 14.98 ± 51% perf-profile.children.cycles-pp.__fdget 12.05 ± 36% -12.1 0.00 perf-profile.self.cycles-pp.__fget_light 0.31 ± 42% +14.6 14.91 ± 52% perf-profile.self.cycles-pp.__fdget 0.19 ± 2% +0.0 0.20 ± 3% perf-stat.i.dTLB-store-miss-rate% 1585715 ± 8% +93.4% 3067285 ± 30% perf-stat.i.iTLB-load-misses 0.17 ± 2% +0.0 0.19 ± 3% perf-stat.overall.dTLB-store-miss-rate% 88.15 ± 5% +4.9 93.07 perf-stat.overall.iTLB-load-miss-rate% 48830 ± 8% -45.0% 26871 ± 25% perf-stat.overall.instructions-per-iTLB-miss 1.41 -1.8% 1.38 perf-stat.overall.ipc 1573086 ± 8% +93.7% 3047643 ± 30% perf-stat.ps.iTLB-load-misses *************************************************************************************************** lkp-cpl-4sp2: 224 threads 4 sockets Intel(R) Xeon(R) Platinum 8380H CPU @ 2.90GHz (Cooper Lake) with 192G memory ========================================================================================= compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase: gcc-12/performance/x86_64-rhel-8.3/process/100%/debian-11.1-x86_64-20220510.cgz/lkp-cpl-4sp2/poll2/will-it-scale commit: 7cb537b6f6 ("file: massage cleanup of files that failed to open") 253ca8678d ("Improve __fget_files_rcu() code generation (and thus __fget_light())") 7cb537b6f6d7d652 253ca8678d30bcf94410b54476f ---------------- --------------------------- %stddev %change %stddev \ | \ 685.00 ± 5% +62.3% 1111 ± 13% perf-c2c.HITM.local 0.04 ±187% +482.9% 0.21 ± 50% perf-sched.sch_delay.avg.ms.schedule_hrtimeout_range_clock.do_poll.constprop.0.do_sys_poll 136406 +2.0% 139095 proc-vmstat.nr_active_anon 136406 +2.0% 139095 proc-vmstat.nr_zone_active_anon 98393191 +10.3% 1.085e+08 will-it-scale.224.processes 439254 +10.3% 484377 will-it-scale.per_process_ops 98393191 +10.3% 1.085e+08 will-it-scale.workload 0.00 +28.2% 0.00 ± 17% perf-stat.i.MPKI 2.226e+11 -2.2% 2.178e+11 perf-stat.i.branch-instructions 0.28 +0.0 0.30 perf-stat.i.branch-miss-rate% 6.155e+08 +7.4% 6.608e+08 perf-stat.i.branch-misses 12.91 -3.3 9.62 ± 13% perf-stat.i.cache-miss-rate% 1955843 +22.9% 2402856 ± 17% perf-stat.i.cache-misses 15946481 +59.2% 25391906 ± 9% perf-stat.i.cache-references 0.59 +5.0% 0.62 perf-stat.i.cpi 408471 -17.9% 335390 ± 14% perf-stat.i.cycles-between-cache-misses 2.901e+11 -4.0% 2.784e+11 perf-stat.i.dTLB-loads 0.00 ± 9% +0.0 0.00 ± 10% perf-stat.i.dTLB-store-miss-rate% 1.814e+11 -12.6% 1.585e+11 perf-stat.i.dTLB-stores 26765498 +9.7% 29360826 perf-stat.i.iTLB-load-misses 1.23e+12 -4.4% 1.176e+12 perf-stat.i.instructions 46105 -12.9% 40163 perf-stat.i.instructions-per-iTLB-miss 1.69 -4.8% 1.61 perf-stat.i.ipc 1.30 -4.1% 1.24 perf-stat.i.metric.G/sec 75.67 +56.5% 118.40 ± 9% perf-stat.i.metric.K/sec 1802 -6.9% 1679 perf-stat.i.metric.M/sec 91.19 +1.9 93.14 perf-stat.i.node-load-miss-rate% 603847 +29.4% 781631 ± 13% perf-stat.i.node-load-misses 0.00 ± 44% +54.2% 0.00 ± 17% perf-stat.overall.MPKI 0.23 ± 44% +0.1 0.30 perf-stat.overall.branch-miss-rate% 0.49 ± 44% +26.0% 0.62 perf-stat.overall.cpi 0.00 ± 46% +0.0 0.00 ± 10% perf-stat.overall.dTLB-store-miss-rate% 73.34 ± 44% +18.0 91.29 perf-stat.overall.node-load-miss-rate% 5.111e+08 ± 44% +28.9% 6.586e+08 perf-stat.ps.branch-misses 1626781 ± 44% +47.4% 2397620 ± 17% perf-stat.ps.cache-misses 13269755 ± 44% +91.5% 25415998 ± 9% perf-stat.ps.cache-references 22231799 ± 44% +31.6% 29255242 perf-stat.ps.iTLB-load-misses 501267 ± 44% +55.4% 779219 ± 13% perf-stat.ps.node-load-misses 16030 ± 45% +33.6% 21409 ± 6% perf-stat.ps.node-stores 47.56 -47.6 0.00 perf-profile.calltrace.cycles-pp.__fget_light.do_poll.do_sys_poll.__x64_sys_poll.do_syscall_64 67.41 -2.9 64.56 perf-profile.calltrace.cycles-pp.do_poll.do_sys_poll.__x64_sys_poll.do_syscall_64.entry_SYSCALL_64_after_hwframe 87.35 -1.2 86.15 perf-profile.calltrace.cycles-pp.do_sys_poll.__x64_sys_poll.do_syscall_64.entry_SYSCALL_64_after_hwframe.__poll 87.96 -1.1 86.82 perf-profile.calltrace.cycles-pp.__x64_sys_poll.do_syscall_64.entry_SYSCALL_64_after_hwframe.__poll 88.69 -1.1 87.62 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__poll 89.02 -1.1 87.97 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__poll 91.89 -0.8 91.12 perf-profile.calltrace.cycles-pp.__poll 0.81 +0.0 0.85 perf-profile.calltrace.cycles-pp.__check_heap_object.__check_object_size.do_sys_poll.__x64_sys_poll.do_syscall_64 0.64 +0.1 0.69 ± 2% perf-profile.calltrace.cycles-pp.__kmem_cache_free.do_sys_poll.__x64_sys_poll.do_syscall_64.entry_SYSCALL_64_after_hwframe 0.68 +0.1 0.74 perf-profile.calltrace.cycles-pp.kfree.do_sys_poll.__x64_sys_poll.do_syscall_64.entry_SYSCALL_64_after_hwframe 1.26 +0.1 1.32 perf-profile.calltrace.cycles-pp.check_heap_object.__check_object_size.do_sys_poll.__x64_sys_poll.do_syscall_64 0.84 +0.1 0.94 ± 2% perf-profile.calltrace.cycles-pp.__virt_addr_valid.check_heap_object.__check_object_size.do_sys_poll.__x64_sys_poll 1.53 +0.1 1.67 perf-profile.calltrace.cycles-pp.__kmem_cache_alloc_node.__kmalloc.do_sys_poll.__x64_sys_poll.do_syscall_64 1.82 +0.2 1.98 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64.__poll 2.60 +0.2 2.76 perf-profile.calltrace.cycles-pp.__check_object_size.do_sys_poll.__x64_sys_poll.do_syscall_64.entry_SYSCALL_64_after_hwframe 1.91 +0.2 2.09 perf-profile.calltrace.cycles-pp.__kmalloc.do_sys_poll.__x64_sys_poll.do_syscall_64.entry_SYSCALL_64_after_hwframe 2.44 ± 2% +0.2 2.62 perf-profile.calltrace.cycles-pp.rep_movs_alternative._copy_from_user.do_sys_poll.__x64_sys_poll.do_syscall_64 3.86 +0.3 4.20 perf-profile.calltrace.cycles-pp._copy_from_user.do_sys_poll.__x64_sys_poll.do_syscall_64.entry_SYSCALL_64_after_hwframe 7.94 +0.8 8.70 perf-profile.calltrace.cycles-pp.testcase 3.60 +42.4 45.95 perf-profile.calltrace.cycles-pp.__fdget.do_poll.do_sys_poll.__x64_sys_poll.do_syscall_64 45.80 -45.8 0.00 perf-profile.children.cycles-pp.__fget_light 69.22 -2.7 66.50 perf-profile.children.cycles-pp.do_poll 87.48 -1.2 86.29 perf-profile.children.cycles-pp.do_sys_poll 87.99 -1.1 86.85 perf-profile.children.cycles-pp.__x64_sys_poll 88.74 -1.1 87.67 perf-profile.children.cycles-pp.do_syscall_64 89.06 -1.0 88.01 perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe 91.99 -0.8 91.23 perf-profile.children.cycles-pp.__poll 0.08 +0.0 0.09 ± 4% perf-profile.children.cycles-pp.is_vmalloc_addr 0.14 ± 2% +0.0 0.16 ± 3% perf-profile.children.cycles-pp.exit_to_user_mode_prepare 0.24 +0.0 0.26 perf-profile.children.cycles-pp.memcg_slab_post_alloc_hook 0.16 ± 3% +0.0 0.17 perf-profile.children.cycles-pp.rcu_all_qs 0.13 ± 3% +0.0 0.14 ± 2% perf-profile.children.cycles-pp.kmalloc_slab 0.12 ± 3% +0.0 0.14 ± 3% perf-profile.children.cycles-pp.syscall_enter_from_user_mode 0.21 ± 2% +0.0 0.24 perf-profile.children.cycles-pp.check_stack_object 0.24 ± 2% +0.0 0.27 perf-profile.children.cycles-pp.poll@plt 0.15 ± 2% +0.0 0.18 ± 2% perf-profile.children.cycles-pp.entry_SYSCALL_64_safe_stack 0.24 ± 2% +0.0 0.26 perf-profile.children.cycles-pp.__cond_resched 0.36 +0.0 0.40 ± 2% perf-profile.children.cycles-pp.syscall_exit_to_user_mode 0.81 +0.0 0.86 perf-profile.children.cycles-pp.__check_heap_object 0.48 +0.0 0.53 perf-profile.children.cycles-pp.syscall_return_via_sysret 0.65 +0.1 0.70 perf-profile.children.cycles-pp.__kmem_cache_free 0.68 +0.1 0.74 perf-profile.children.cycles-pp.kfree 0.70 +0.1 0.76 perf-profile.children.cycles-pp.entry_SYSRETQ_unsafe_stack 1.32 +0.1 1.39 perf-profile.children.cycles-pp.check_heap_object 1.14 +0.1 1.23 perf-profile.children.cycles-pp.entry_SYSCALL_64 0.85 +0.1 0.96 perf-profile.children.cycles-pp.__virt_addr_valid 1.60 +0.1 1.76 perf-profile.children.cycles-pp.__kmem_cache_alloc_node 2.76 +0.2 2.94 perf-profile.children.cycles-pp.__check_object_size 1.94 +0.2 2.13 perf-profile.children.cycles-pp.__kmalloc 2.48 ± 2% +0.2 2.67 perf-profile.children.cycles-pp.rep_movs_alternative 4.09 +0.4 4.45 perf-profile.children.cycles-pp._copy_from_user 8.04 +0.8 8.81 perf-profile.children.cycles-pp.testcase 3.58 +40.5 44.04 perf-profile.children.cycles-pp.__fdget 43.81 -43.8 0.00 perf-profile.self.cycles-pp.__fget_light 0.40 -0.0 0.38 perf-profile.self.cycles-pp.check_heap_object 0.15 +0.0 0.16 perf-profile.self.cycles-pp.poll_select_set_timeout 0.06 +0.0 0.07 perf-profile.self.cycles-pp.is_vmalloc_addr 0.10 ± 4% +0.0 0.12 ± 4% perf-profile.self.cycles-pp.exit_to_user_mode_prepare 0.14 ± 2% +0.0 0.15 ± 2% perf-profile.self.cycles-pp.rcu_all_qs 0.11 ± 4% +0.0 0.13 ± 2% perf-profile.self.cycles-pp.kmalloc_slab 0.11 +0.0 0.12 ± 4% perf-profile.self.cycles-pp.syscall_enter_from_user_mode 0.21 +0.0 0.23 ± 2% perf-profile.self.cycles-pp.memcg_slab_post_alloc_hook 0.14 ± 3% +0.0 0.16 perf-profile.self.cycles-pp.poll@plt 0.18 ± 2% +0.0 0.20 perf-profile.self.cycles-pp.check_stack_object 0.15 ± 2% +0.0 0.17 ± 2% perf-profile.self.cycles-pp.entry_SYSCALL_64_safe_stack 0.22 ± 2% +0.0 0.24 ± 2% perf-profile.self.cycles-pp.__kmalloc 0.32 ± 2% +0.0 0.34 ± 2% perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe 0.25 +0.0 0.28 perf-profile.self.cycles-pp.do_syscall_64 0.43 +0.0 0.47 perf-profile.self.cycles-pp.__check_object_size 0.45 +0.0 0.48 perf-profile.self.cycles-pp.entry_SYSCALL_64 0.36 +0.0 0.40 ± 2% perf-profile.self.cycles-pp.__x64_sys_poll 0.81 +0.0 0.85 perf-profile.self.cycles-pp.__check_heap_object 0.48 +0.0 0.52 perf-profile.self.cycles-pp.syscall_return_via_sysret 0.65 +0.1 0.70 perf-profile.self.cycles-pp.__kmem_cache_free 0.68 +0.1 0.74 perf-profile.self.cycles-pp.entry_SYSRETQ_unsafe_stack 0.66 +0.1 0.72 perf-profile.self.cycles-pp.kfree 0.81 +0.1 0.91 ± 2% perf-profile.self.cycles-pp.__virt_addr_valid 1.05 ± 4% +0.1 1.16 ± 3% perf-profile.self.cycles-pp.__poll 1.13 +0.1 1.24 perf-profile.self.cycles-pp.__kmem_cache_alloc_node 1.73 +0.2 1.90 perf-profile.self.cycles-pp._copy_from_user 2.33 ± 2% +0.2 2.52 perf-profile.self.cycles-pp.rep_movs_alternative 8.10 +0.7 8.80 perf-profile.self.cycles-pp.do_sys_poll 7.94 +0.8 8.69 perf-profile.self.cycles-pp.testcase 23.27 +1.0 24.26 perf-profile.self.cycles-pp.do_poll 1.79 +40.1 41.93 perf-profile.self.cycles-pp.__fdget Disclaimer: Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki