On 6/21/23 14:51, Mathieu Desnoyers wrote:
On 6/21/23 12:36, Swapnil Sapkal wrote:Hello Mathieu,[...]I suspect the regression is caused by the mm_count cache line bouncing. Please try with this additional patch applied: https://lore.kernel.org/lkml/20230515143536.114960-1-mathieu.desnoyers@xxxxxxxxxxxx/Thanks for the suggestion. I tried out with the patch you suggested. I am seeing improvement in hackbench numbers with mm_count padding. But this is not matchingwith what we achieved through reverting the new mm_cid patch.Below are the results on the 1 Socket 4th Generation EPYC Processor (1 x 96C/192T):Threads:Test: Base (v6.4-rc1) Base + new_mmcid_reverted Base + mm_count_padding 1-groups: 5.23 (0.00 pct) 4.61 (11.85 pct) 5.11 (2.29 pct) 2-groups: 4.99 (0.00 pct) 4.72 (5.41 pct) 5.00 (-0.20 pct) 4-groups: 5.96 (0.00 pct) 4.87 (18.28 pct) 5.86 (1.67 pct) 8-groups: 6.58 (0.00 pct) 5.44 (17.32 pct) 6.20 (5.77 pct) 16-groups: 11.48 (0.00 pct) 8.07 (29.70 pct) 10.68 (6.96 pct)Processes:Test: Base (v6.4-rc1) Base + new_mmcid_reverted Base + mm_count_padding 1-groups: 5.19 (0.00 pct) 4.90 (5.58 pct) 5.19 (0.00 pct) 2-groups: 5.44 (0.00 pct) 5.39 (0.91 pct) 5.39 (0.91 pct) 4-groups: 5.69 (0.00 pct) 5.64 (0.87 pct) 5.64 (0.87 pct) 8-groups: 6.08 (0.00 pct) 6.01 (1.15 pct) 6.04 (0.65 pct) 16-groups: 10.87 (0.00 pct) 10.83 (0.36 pct) 10.93 (-0.55 pct)The ibs profile shows that function __switch_to_asm() is coming at top in baseline run and is not seen with mm_count padding patch. Will be attaching full ibs profiledata for all the 3 runs: # Base (v6.4-rc1) Threads: Total time: 11.486 [sec] 5.15% sched-messaging [kernel.vmlinux] [k] __switch_to_asm 4.31% sched-messaging [kernel.vmlinux] [k] copyout4.29% sched-messaging [kernel.vmlinux] [k] native_queued_spin_lock_slowpath4.22% sched-messaging [kernel.vmlinux] [k] copyin3.92% sched-messaging [kernel.vmlinux] [k] apparmor_file_permission2.91% sched-messaging [kernel.vmlinux] [k] __schedule 2.34% swapper [kernel.vmlinux] [k] __switch_to_asm2.10% sched-messaging [kernel.vmlinux] [k] prepare_to_wait_event2.10% sched-messaging [kernel.vmlinux] [k] try_to_wake_up2.07% sched-messaging [kernel.vmlinux] [k] finish_task_switch.isra.02.00% sched-messaging [kernel.vmlinux] [k] pipe_write1.82% sched-messaging [kernel.vmlinux] [k] check_preemption_disabled 1.73% sched-messaging [kernel.vmlinux] [k] exit_to_user_mode_prepare1.52% sched-messaging [kernel.vmlinux] [k] __entry_text_start 1.49% sched-messaging [kernel.vmlinux] [k] osq_lock 1.45% sched-messaging libc.so.6 [.] write 1.44% swapper [kernel.vmlinux] [k] native_sched_clock 1.38% sched-messaging [kernel.vmlinux] [k] psi_group_change 1.38% sched-messaging [kernel.vmlinux] [k] pipe_read 1.37% sched-messaging libc.so.6 [.] read 1.06% sched-messaging [kernel.vmlinux] [k] vfs_read 1.01% swapper [kernel.vmlinux] [k] psi_group_change 1.00% sched-messaging [kernel.vmlinux] [k] update_curr # Base + mm_count_padding Threads: Total time: 11.384 [sec] 4.43% sched-messaging [kernel.vmlinux] [k] copyin4.39% sched-messaging [kernel.vmlinux] [k] native_queued_spin_lock_slowpath 4.07% sched-messaging [kernel.vmlinux] [k] apparmor_file_permission4.07% sched-messaging [kernel.vmlinux] [k] copyout 2.49% sched-messaging [kernel.vmlinux] [k] entry_SYSCALL_64 2.37% sched-messaging [kernel.vmlinux] [k] update_cfs_group 2.19% sched-messaging [kernel.vmlinux] [k] pipe_write2.00% sched-messaging [kernel.vmlinux] [k] check_preemption_disabled1.93% swapper [kernel.vmlinux] [k] update_load_avg1.81% sched-messaging [kernel.vmlinux] [k] exit_to_user_mode_prepare1.69% sched-messaging [kernel.vmlinux] [k] try_to_wake_up 1.58% sched-messaging libc.so.6 [.] write 1.53% sched-messaging [kernel.vmlinux] [k] psi_group_change 1.50% sched-messaging libc.so.6 [.] read 1.50% sched-messaging [kernel.vmlinux] [k] pipe_read 1.39% sched-messaging [kernel.vmlinux] [k] update_load_avg 1.39% sched-messaging [kernel.vmlinux] [k] osq_lock 1.30% sched-messaging [kernel.vmlinux] [k] update_curr 1.28% swapper [kernel.vmlinux] [k] psi_group_change 1.16% sched-messaging [kernel.vmlinux] [k] vfs_read 1.12% sched-messaging [kernel.vmlinux] [k] vfs_write1.10% sched-messaging [kernel.vmlinux] [k] entry_SYSRETQ_unsafe_stack1.09% sched-messaging [kernel.vmlinux] [k] __switch_to_asm 1.08% sched-messaging [kernel.vmlinux] [k] do_syscall_641.06% sched-messaging [kernel.vmlinux] [k] select_task_rq_fair1.03% swapper [kernel.vmlinux] [k] update_cfs_group 1.00% swapper [kernel.vmlinux] [k] rb_insert_color # Base + reverted_new_mm_cid Threads: Total time: 7.847 [sec]12.14% sched-messaging [kernel.vmlinux] [k] native_queued_spin_lock_slowpath 8.86% swapper [kernel.vmlinux] [k] native_queued_spin_lock_slowpath6.13% sched-messaging [kernel.vmlinux] [k] copyin5.54% sched-messaging [kernel.vmlinux] [k] apparmor_file_permission3.59% sched-messaging [kernel.vmlinux] [k] copyout 2.61% sched-messaging [kernel.vmlinux] [k] osq_lock 2.48% sched-messaging [kernel.vmlinux] [k] pipe_write2.33% sched-messaging [kernel.vmlinux] [k] exit_to_user_mode_prepare 2.01% sched-messaging [kernel.vmlinux] [k] check_preemption_disabled1.96% sched-messaging [kernel.vmlinux] [k] __entry_text_start 1.91% sched-messaging libc.so.6 [.] write 1.77% sched-messaging libc.so.6 [.] read 1.64% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner 1.58% sched-messaging [kernel.vmlinux] [k] pipe_read 1.52% sched-messaging [kernel.vmlinux] [k] try_to_wake_up1.38% sched-messaging [kernel.vmlinux] [k] ktime_get_coarse_real_ts641.35% sched-messaging [kernel.vmlinux] [k] vfs_write1.28% sched-messaging [kernel.vmlinux] [k] entry_SYSRETQ_unsafe_stack1.28% sched-messaging [kernel.vmlinux] [k] vfs_read 1.25% sched-messaging [kernel.vmlinux] [k] do_syscall_64 1.22% sched-messaging [kernel.vmlinux] [k] __fget_light 1.18% sched-messaging [kernel.vmlinux] [k] mutex_lock 1.12% sched-messaging [kernel.vmlinux] [k] file_update_time 1.04% sched-messaging [kernel.vmlinux] [k] _copy_from_iter 1.01% sched-messaging [kernel.vmlinux] [k] current_timeSo with the reverted new_mm_cid patch, we are seeing a lot of time being spent innative_queued_spin_lock_slowpath and yet, hackbench finishes faster.I keep further digging into this please let me know if you have any pointers for me.Do you have CONFIG_SECURITY_APPARMOR=y ? Can you try without ? I notice that apparmor_file_permission appears near the top of your profiles, and apparmor uses an internal aa_buffers_lock spinlock, which could possibly explain the top hits for native_queued_spin_lock_slowpath. My current suspicion is that the raw spinlock that was taken by "Base + reverted_new_mm_cid" changed the contention pattern on the apparmor lock enough to speed things up by pure accident.
If apparmor happens to be the culprit here, we should have a hard look at this commit:
commit df323337e50 "apparmor: Use a memory pool instead per-CPU caches"Which turned a per-cpu cache into a global memory pool protected by a spinlock. It may benefit RT, but it does not appear to be so great at scaling.
Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com