Hello Mathieu, On 6/20/2023 4:21 PM, Mathieu Desnoyers wrote:
On 6/20/23 06:35, Swapnil Sapkal wrote:Hello Peter, On 6/20/2023 2:41 PM, Peter Zijlstra wrote:On Tue, Jun 20, 2023 at 01:44:32PM +0530, Swapnil Sapkal wrote:Hello Mathieu, On 4/22/2023 1:13 PM, tip-bot2 for Mathieu Desnoyers wrote:The following commit has been merged into the sched/core branch of tip: Commit-ID: 223baf9d17f25e2608dbdff7232c095c1e612268 Gitweb: https://git.kernel.org/tip/223baf9d17f25e2608dbdff7232c095c1e612268 Author: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> AuthorDate: Thu, 20 Apr 2023 10:55:48 -04:00 Committer: Peter Zijlstra <peterz@xxxxxxxxxxxxx> CommitterDate: Fri, 21 Apr 2023 13:24:20 +02:00 sched: Fix performance regression introduced by mm_cid Introduce per-mm/cpu current concurrency id (mm_cid) to fix a PostgreSQL sysbench regression reported by Aaron Lu. Keep track of the currently allocated mm_cid for each mm/cpu rather than freeing them immediately on context switch. This eliminates most atomic operations when context switching back and forth between threads belonging to different memory spaces in multi-threaded scenarios (many processes, each with many threads). The per-mm/per-cpu mm_cid values are serialized by their respective runqueue locks. Thread migration is handled by introducing invocation to sched_mm_cid_migrate_to() (with destination runqueue lock held) in activate_task() for migrating tasks. If the destination cpu's mm_cid is unset, and if the source runqueue is not actively using its mm_cid, then the source cpu's mm_cid is moved to the destination cpu on migration. Introduce a task-work executed periodically, similarly to NUMA work, which delays reclaim of cid values when they are unused for a period of time. Keep track of the allocation time for each per-cpu cid, and let the task work clear them when they are observed to be older than SCHED_MM_CID_PERIOD_NS and unused. This task work also clears all mm_cids which are greater or equal to the Hamming weight of the mm cidmask to keep concurrency ids compact. Because we want to ensure the mm_cid converges towards the smaller values as migrations happen, the prior optimization that was done when context switching between threads belonging to the same mm is removed, because it could delay the lazy release of the destination runqueue mm_cid after it has been replaced by a migration. Removing this prior optimization is not an issue performance-wise because the introduced per-mm/per-cpu mm_cid tracking also covers this more specific case. Fixes: af7f588d8f73 ("sched: Introduce per-memory-map concurrency ID") Reported-by: Aaron Lu <aaron.lu@xxxxxxxxx> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx> Tested-by: Aaron Lu <aaron.lu@xxxxxxxxx> Link: https://lore.kernel.org/lkml/20230327080502.GA570847@ziqianlu-desk2/I run standard benchmarks as a part of kernel performance regression testing. When I run these benchmarks against v6.3.0 to v6.4-rc1, I have seen performance regression in hackbench running with threads. When I did git bisect it pointed to this commit and reverting this commit helps regains the performance. This regression is not seen with hackbench processes.Well, *this* commit was supposed to help fix the horrible contention on cid_lock that was introduced with af7f588d8f73.I went back and tested the commit that introduced mm_cid and I found that the original implementation actually helped hackbench. Following are numbers from 2 Socket Zen3 Server (2 X 64C/128T): Test: base (v6.2-rc1) base + orig_mm_cid 1-groups: 4.29 (0.00 pct) 4.32 (-0.69 pct) 2-groups: 4.96 (0.00 pct) 4.94 (0.40 pct) 4-groups: 5.21 (0.00 pct) 4.10 (21.30 pct) 8-groups: 5.44 (0.00 pct) 4.50 (17.27 pct) 16-groups: 7.09 (0.00 pct) 5.28 (25.52 pct) I see following IBS traces in this case: Base: 6.69% sched-messaging [kernel.vmlinux] [k] copy_user_generic_string 5.38% sched-messaging [kernel.vmlinux] [k] native_queued_spin_lock_slowpath 3.73% swapper [kernel.vmlinux] [k] __switch_to_asm 3.23% sched-messaging [kernel.vmlinux] [k] __calc_delta 2.93% sched-messaging [kernel.vmlinux] [k] try_to_wake_up 2.63% sched-messaging [kernel.vmlinux] [k] dequeue_task_fair 2.56% sched-messaging [kernel.vmlinux] [k] osq_lock Base + orig_mm_cid: 13.70% sched-messaging [kernel.vmlinux] [k] native_queued_spin_lock_slowpath 11.87% swapper [kernel.vmlinux] [k] native_queued_spin_lock_slowpath 8.99% sched-messaging [kernel.vmlinux] [k] copy_user_generic_string 6.08% sched-messaging [kernel.vmlinux] [k] osq_lock 4.79% sched-messaging [kernel.vmlinux] [k] apparmor_file_permission 3.71% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner 3.66% sched-messaging [kernel.vmlinux] [k] ktime_get_coarse_real_ts64 3.11% sched-messaging [kernel.vmlinux] [k] _copy_from_iterFollowing are the results from 1 Socket 4th generation EPYC Processor(1 X 96C/192T) configured in NPS1 mode. This regression becomes more severe as the number of core count increases. The numbers on a 1 Socket Bergamo (1 X 128 cores/256 threads) is significantly worse. Threads: Test: With-mmcid-patch Without-mmcid-patch 1-groups: 5.23 (0.00 pct) 4.61 (+11.85 pct) 2-groups: 4.99 (0.00 pct) 4.72 (+5.41 pct) 4-groups: 5.96 (0.00 pct) 4.87 (+18.28 pct) 8-groups: 6.58 (0.00 pct) 5.44 (+17.32 pct) 16-groups: 11.48 (0.00 pct) 8.07 (+29.70 pct)I'm really confused, so you're saying that having a process wide spinlock is better than what this patch does? Or are you testing against something without mm-cid entirely?It does look like the lock contention introduced by the original mm_cid patch helped hackbench in this case. In that case, I see hackbench threads run for longer on average (avg_atom) and total idle entries are down significantly. Even on disabling C1 and C2, I see similar behavior. With the new mm_cid patch that gets rid of the lock contention, we see a drop in the hackbench performance. I will go dig into this further meanwhile if you have any pointers please do let me know.I suspect the baseline don't have spinlock contention because the test-case schedules between threads belonging to the same process, for which the initial mm_cid patch had an optimization which skips the spinlock entirely. This optimization for inter-thread scheduling had to be removed in the following patch to address the performance issue more generally, covering the inter-process scheduling. I suspect the regression is caused by the mm_count cache line bouncing. Please try with this additional patch applied: https://lore.kernel.org/lkml/20230515143536.114960-1-mathieu.desnoyers@xxxxxxxxxxxx/
Thanks for the suggestion. I tried out with the patch you suggested. I am seeing improvement in hackbench numbers with mm_count padding. But this is not matching with what we achieved through reverting the new mm_cid patch. Below are the results on the 1 Socket 4th Generation EPYC Processor (1 x 96C/192T): Threads: Test: Base (v6.4-rc1) Base + new_mmcid_reverted Base + mm_count_padding 1-groups: 5.23 (0.00 pct) 4.61 (11.85 pct) 5.11 (2.29 pct) 2-groups: 4.99 (0.00 pct) 4.72 (5.41 pct) 5.00 (-0.20 pct) 4-groups: 5.96 (0.00 pct) 4.87 (18.28 pct) 5.86 (1.67 pct) 8-groups: 6.58 (0.00 pct) 5.44 (17.32 pct) 6.20 (5.77 pct) 16-groups: 11.48 (0.00 pct) 8.07 (29.70 pct) 10.68 (6.96 pct) Processes: Test: Base (v6.4-rc1) Base + new_mmcid_reverted Base + mm_count_padding 1-groups: 5.19 (0.00 pct) 4.90 (5.58 pct) 5.19 (0.00 pct) 2-groups: 5.44 (0.00 pct) 5.39 (0.91 pct) 5.39 (0.91 pct) 4-groups: 5.69 (0.00 pct) 5.64 (0.87 pct) 5.64 (0.87 pct) 8-groups: 6.08 (0.00 pct) 6.01 (1.15 pct) 6.04 (0.65 pct) 16-groups: 10.87 (0.00 pct) 10.83 (0.36 pct) 10.93 (-0.55 pct) The ibs profile shows that function __switch_to_asm() is coming at top in baseline run and is not seen with mm_count padding patch. Will be attaching full ibs profile data for all the 3 runs: # Base (v6.4-rc1) Threads: Total time: 11.486 [sec] 5.15% sched-messaging [kernel.vmlinux] [k] __switch_to_asm 4.31% sched-messaging [kernel.vmlinux] [k] copyout 4.29% sched-messaging [kernel.vmlinux] [k] native_queued_spin_lock_slowpath 4.22% sched-messaging [kernel.vmlinux] [k] copyin 3.92% sched-messaging [kernel.vmlinux] [k] apparmor_file_permission 2.91% sched-messaging [kernel.vmlinux] [k] __schedule 2.34% swapper [kernel.vmlinux] [k] __switch_to_asm 2.10% sched-messaging [kernel.vmlinux] [k] prepare_to_wait_event 2.10% sched-messaging [kernel.vmlinux] [k] try_to_wake_up 2.07% sched-messaging [kernel.vmlinux] [k] finish_task_switch.isra.0 2.00% sched-messaging [kernel.vmlinux] [k] pipe_write 1.82% sched-messaging [kernel.vmlinux] [k] check_preemption_disabled 1.73% sched-messaging [kernel.vmlinux] [k] exit_to_user_mode_prepare 1.52% sched-messaging [kernel.vmlinux] [k] __entry_text_start 1.49% sched-messaging [kernel.vmlinux] [k] osq_lock 1.45% sched-messaging libc.so.6 [.] write 1.44% swapper [kernel.vmlinux] [k] native_sched_clock 1.38% sched-messaging [kernel.vmlinux] [k] psi_group_change 1.38% sched-messaging [kernel.vmlinux] [k] pipe_read 1.37% sched-messaging libc.so.6 [.] read 1.06% sched-messaging [kernel.vmlinux] [k] vfs_read 1.01% swapper [kernel.vmlinux] [k] psi_group_change 1.00% sched-messaging [kernel.vmlinux] [k] update_curr # Base + mm_count_padding Threads: Total time: 11.384 [sec] 4.43% sched-messaging [kernel.vmlinux] [k] copyin 4.39% sched-messaging [kernel.vmlinux] [k] native_queued_spin_lock_slowpath 4.07% sched-messaging [kernel.vmlinux] [k] apparmor_file_permission 4.07% sched-messaging [kernel.vmlinux] [k] copyout 2.49% sched-messaging [kernel.vmlinux] [k] entry_SYSCALL_64 2.37% sched-messaging [kernel.vmlinux] [k] update_cfs_group 2.19% sched-messaging [kernel.vmlinux] [k] pipe_write 2.00% sched-messaging [kernel.vmlinux] [k] check_preemption_disabled 1.93% swapper [kernel.vmlinux] [k] update_load_avg 1.81% sched-messaging [kernel.vmlinux] [k] exit_to_user_mode_prepare 1.69% sched-messaging [kernel.vmlinux] [k] try_to_wake_up 1.58% sched-messaging libc.so.6 [.] write 1.53% sched-messaging [kernel.vmlinux] [k] psi_group_change 1.50% sched-messaging libc.so.6 [.] read 1.50% sched-messaging [kernel.vmlinux] [k] pipe_read 1.39% sched-messaging [kernel.vmlinux] [k] update_load_avg 1.39% sched-messaging [kernel.vmlinux] [k] osq_lock 1.30% sched-messaging [kernel.vmlinux] [k] update_curr 1.28% swapper [kernel.vmlinux] [k] psi_group_change 1.16% sched-messaging [kernel.vmlinux] [k] vfs_read 1.12% sched-messaging [kernel.vmlinux] [k] vfs_write 1.10% sched-messaging [kernel.vmlinux] [k] entry_SYSRETQ_unsafe_stack 1.09% sched-messaging [kernel.vmlinux] [k] __switch_to_asm 1.08% sched-messaging [kernel.vmlinux] [k] do_syscall_64 1.06% sched-messaging [kernel.vmlinux] [k] select_task_rq_fair 1.03% swapper [kernel.vmlinux] [k] update_cfs_group 1.00% swapper [kernel.vmlinux] [k] rb_insert_color # Base + reverted_new_mm_cid Threads: Total time: 7.847 [sec] 12.14% sched-messaging [kernel.vmlinux] [k] native_queued_spin_lock_slowpath 8.86% swapper [kernel.vmlinux] [k] native_queued_spin_lock_slowpath 6.13% sched-messaging [kernel.vmlinux] [k] copyin 5.54% sched-messaging [kernel.vmlinux] [k] apparmor_file_permission 3.59% sched-messaging [kernel.vmlinux] [k] copyout 2.61% sched-messaging [kernel.vmlinux] [k] osq_lock 2.48% sched-messaging [kernel.vmlinux] [k] pipe_write 2.33% sched-messaging [kernel.vmlinux] [k] exit_to_user_mode_prepare 2.01% sched-messaging [kernel.vmlinux] [k] check_preemption_disabled 1.96% sched-messaging [kernel.vmlinux] [k] __entry_text_start 1.91% sched-messaging libc.so.6 [.] write 1.77% sched-messaging libc.so.6 [.] read 1.64% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner 1.58% sched-messaging [kernel.vmlinux] [k] pipe_read 1.52% sched-messaging [kernel.vmlinux] [k] try_to_wake_up 1.38% sched-messaging [kernel.vmlinux] [k] ktime_get_coarse_real_ts64 1.35% sched-messaging [kernel.vmlinux] [k] vfs_write 1.28% sched-messaging [kernel.vmlinux] [k] entry_SYSRETQ_unsafe_stack 1.28% sched-messaging [kernel.vmlinux] [k] vfs_read 1.25% sched-messaging [kernel.vmlinux] [k] do_syscall_64 1.22% sched-messaging [kernel.vmlinux] [k] __fget_light 1.18% sched-messaging [kernel.vmlinux] [k] mutex_lock 1.12% sched-messaging [kernel.vmlinux] [k] file_update_time 1.04% sched-messaging [kernel.vmlinux] [k] _copy_from_iter 1.01% sched-messaging [kernel.vmlinux] [k] current_time So with the reverted new_mm_cid patch, we are seeing a lot of time being spent in native_queued_spin_lock_slowpath and yet, hackbench finishes faster. I keep further digging into this please let me know if you have any pointers for me.
This patch has recently been merged into the mm tree. Thanks, Mathieu
-- Thanks and Regards, Swapnil