Junxiao Bi <junxiao.bi@xxxxxxxxxx> writes: > On 6/18/20 5:02 PM, ebiederm@xxxxxxxxxxxx wrote: > >> Matthew Wilcox <willy@xxxxxxxxxxxxx> writes: >> >>> On Thu, Jun 18, 2020 at 03:17:33PM -0700, Junxiao Bi wrote: >>>> When debugging some performance issue, i found that thousands of threads >>>> exit around same time could cause a severe spin lock contention on proc >>>> dentry "/proc/$parent_process_pid/task/", that's because threads needs to >>>> clean up their pid file from that dir when exit. Check the following >>>> standalone test case that simulated the case and perf top result on v5.7 >>>> kernel. Any idea on how to fix this? >>> Thanks, Junxiao. >>> >>> We've looked at a few different ways of fixing this problem. >>> >>> Even though the contention is within the dcache, it seems like a usecase >>> that the dcache shouldn't be optimised for -- generally we do not have >>> hundreds of CPUs removing dentries from a single directory in parallel. >>> >>> We could fix this within procfs. We don't have a great patch yet, but >>> the current approach we're looking at allows only one thread at a time >>> to call dput() on any /proc/*/task directory. >>> >>> We could also look at fixing this within the scheduler. Only allowing >>> one CPU to run the threads of an exiting process would fix this particular >>> problem, but might have other consequences. >>> >>> I was hoping that 7bc3e6e55acf would fix this, but that patch is in 5.7, >>> so that hope is ruled out. >> Does anyone know if problem new in v5.7? I am wondering if I introduced >> this problem when I refactored the code or if I simply churned the code >> but the issue remains effectively the same. > It's not new issue, we see it in old kernel like v4.14 >> >> Can you try only flushing entries when the last thread of the process is >> reaped? I think in practice we would want to be a little more >> sophisticated but it is a good test case to see if it solves the issue. > > Thank you. i will try and let you know. Assuming this works we can remove the hacking part of always doing this by only doing this if SIGNAL_GROUP_EXIT when we know this thundering herd will appear. We still need to do something with the list however. If your testing works out performance wise I will figure out what we need to make the hack generale and safe. Eric > Thanks, > > Junxiao. > >> >> diff --git a/kernel/exit.c b/kernel/exit.c >> index cebae77a9664..d56e4eb60bdd 100644 >> --- a/kernel/exit.c >> +++ b/kernel/exit.c >> @@ -152,7 +152,7 @@ void put_task_struct_rcu_user(struct task_struct *task) >> void release_task(struct task_struct *p) >> { >> struct task_struct *leader; >> - struct pid *thread_pid; >> + struct pid *thread_pid = NULL; >> int zap_leader; >> repeat: >> /* don't need to get the RCU readlock here - the process is dead and >> @@ -165,7 +165,8 @@ void release_task(struct task_struct *p) >> write_lock_irq(&tasklist_lock); >> ptrace_release_task(p); >> - thread_pid = get_pid(p->thread_pid); >> + if (p == p->group_leader) >> + thread_pid = get_pid(p->thread_pid); >> __exit_signal(p); >> /* >> @@ -188,8 +189,10 @@ void release_task(struct task_struct *p) >> } >> write_unlock_irq(&tasklist_lock); >> - proc_flush_pid(thread_pid); >> - put_pid(thread_pid); >> + if (thread_pid) { >> + proc_flush_pid(thread_pid); >> + put_pid(thread_pid); >> + } >> release_thread(p); >> put_task_struct_rcu_user(p); >>