On Tue, Jun 20, 2023 at 01:44:32PM +0530, Swapnil Sapkal wrote: > Hello Mathieu, > > On 4/22/2023 1:13 PM, tip-bot2 for Mathieu Desnoyers wrote: > > The following commit has been merged into the sched/core branch of tip: > > > > Commit-ID: 223baf9d17f25e2608dbdff7232c095c1e612268 > > Gitweb: https://git.kernel.org/tip/223baf9d17f25e2608dbdff7232c095c1e612268 > > Author: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> > > AuthorDate: Thu, 20 Apr 2023 10:55:48 -04:00 > > Committer: Peter Zijlstra <peterz@xxxxxxxxxxxxx> > > CommitterDate: Fri, 21 Apr 2023 13:24:20 +02:00 > > > > sched: Fix performance regression introduced by mm_cid > > > > Introduce per-mm/cpu current concurrency id (mm_cid) to fix a PostgreSQL > > sysbench regression reported by Aaron Lu. > > > > Keep track of the currently allocated mm_cid for each mm/cpu rather than > > freeing them immediately on context switch. This eliminates most atomic > > operations when context switching back and forth between threads > > belonging to different memory spaces in multi-threaded scenarios (many > > processes, each with many threads). The per-mm/per-cpu mm_cid values are > > serialized by their respective runqueue locks. > > > > Thread migration is handled by introducing invocation to > > sched_mm_cid_migrate_to() (with destination runqueue lock held) in > > activate_task() for migrating tasks. If the destination cpu's mm_cid is > > unset, and if the source runqueue is not actively using its mm_cid, then > > the source cpu's mm_cid is moved to the destination cpu on migration. > > > > Introduce a task-work executed periodically, similarly to NUMA work, > > which delays reclaim of cid values when they are unused for a period of > > time. > > > > Keep track of the allocation time for each per-cpu cid, and let the task > > work clear them when they are observed to be older than > > SCHED_MM_CID_PERIOD_NS and unused. This task work also clears all > > mm_cids which are greater or equal to the Hamming weight of the mm > > cidmask to keep concurrency ids compact. > > > > Because we want to ensure the mm_cid converges towards the smaller > > values as migrations happen, the prior optimization that was done when > > context switching between threads belonging to the same mm is removed, > > because it could delay the lazy release of the destination runqueue > > mm_cid after it has been replaced by a migration. Removing this prior > > optimization is not an issue performance-wise because the introduced > > per-mm/per-cpu mm_cid tracking also covers this more specific case. > > > > Fixes: af7f588d8f73 ("sched: Introduce per-memory-map concurrency ID") > > Reported-by: Aaron Lu <aaron.lu@xxxxxxxxx> > > Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> > > Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx> > > Tested-by: Aaron Lu <aaron.lu@xxxxxxxxx> > > Link: https://lore.kernel.org/lkml/20230327080502.GA570847@ziqianlu-desk2/ > > I run standard benchmarks as a part of kernel performance regression > testing. When I run these benchmarks against v6.3.0 to v6.4-rc1, > I have seen performance regression in hackbench running with threads. When I did > git bisect it pointed to this commit and reverting this commit helps regains > the performance. This regression is not seen with hackbench processes. Well, *this* commit was supposed to help fix the horrible contention on cid_lock that was introduced with af7f588d8f73. > Following are the results from 1 Socket 4th generation EPYC > Processor(1 X 96C/192T) configured in NPS1 mode. This regression > becomes more severe as the number of core count increases. > > The numbers on a 1 Socket Bergamo (1 X 128 cores/256 threads) is significantly worse. > > Threads: > > Test: With-mmcid-patch Without-mmcid-patch > 1-groups: 5.23 (0.00 pct) 4.61 (+11.85 pct) > 2-groups: 4.99 (0.00 pct) 4.72 (+5.41 pct) > 4-groups: 5.96 (0.00 pct) 4.87 (+18.28 pct) > 8-groups: 6.58 (0.00 pct) 5.44 (+17.32 pct) > 16-groups: 11.48 (0.00 pct) 8.07 (+29.70 pct) I'm really confused, so you're saying that having a process wide spinlock is better than what this patch does? Or are you testing against something without mm-cid entirely?