Re: [PATCH] proc: Avoid a thundering herd of threads freeing proc dentries

willy@xxxxxxxxxxxxxxxxxxxx · Mon, 22 Jun 2020 16:48:40 +0100

On Mon, Jun 22, 2020 at 10:20:40AM -0500, Eric W. Biederman wrote:
> Junxiao Bi <junxiao.bi@xxxxxxxxxx> writes:
> > On 6/20/20 9:27 AM, Matthew Wilcox wrote:
> >> On Fri, Jun 19, 2020 at 05:42:45PM -0500, Eric W. Biederman wrote:
> >>> Junxiao Bi <junxiao.bi@xxxxxxxxxx> writes:
> >>>> Still high lock contention. Collect the following hot path.
> >>> A different location this time.
> >>>
> >>> I know of at least exit_signal and exit_notify that take thread wide
> >>> locks, and it looks like exit_mm is another.  Those don't use the same
> >>> locks as flushing proc.
> >>>
> >>>
> >>> So I think you are simply seeing a result of the thundering herd of
> >>> threads shutting down at once.  Given that thread shutdown is fundamentally
> >>> a slow path there is only so much that can be done.
> >>>
> >>> If you are up for a project to working through this thundering herd I
> >>> expect I can help some.  It will be a long process of cleaning up
> >>> the entire thread exit process with an eye to performance.
> >> Wengang had some tests which produced wall-clock values for this problem,
> >> which I agree is more informative.
> >>
> >> I'm not entirely sure what the customer workload is that requires a
> >> highly threaded workload to also shut down quickly.  To my mind, an
> >> overall workload is normally composed of highly-threaded tasks that run
> >> for a long time and only shut down rarely (thus performance of shutdown
> >> is not important) and single-threaded tasks that run for a short time.
> >
> > The real workload is a Java application working in server-agent mode, issue
> > happened in agent side, all it do is waiting works dispatching from server and
> > execute. To execute one work, agent will start lots of short live threads, there
> > could be a lot of threads exit same time if there were a lots of work to
> > execute, the contention on the exit path caused a high %sys time which impacted
> > other workload.
> 
> If I understand correctly, the Java VM is not exiting.  Just some of
> it's threads.
> 
> That is a very different problem to deal with.  That are many
> optimizations that are possible when _all_ of the threads are exiting
> that are not possible when _many_ threads are exiting.

Ah!  Now I get it.  This explains why the dput() lock contention was
so important.  A new thread starting would block on that lock as it
tried to create its new /proc/$pid/task/ directory.

Terminating thousands of threads but not the entire process isn't going
to hit many of the locks (eg exit_signal() and exit_mm() aren't going
to be called).  So we need a more sophisticated micro benchmark that is
continually starting threads and asking dozens-to-thousands of them to
stop at the same time.  Otherwise we'll try to fix lots of scalability
problems that our customer doesn't care about.

> Do you know if it is simply the cpu time or if it is the lock contention
> that is the problem?  If it is simply the cpu time we should consider if
> some of the locks that can be highly contended should become mutexes.
> Or perhaps something like Matthew's cpu pinning idea.

If we're not trying to optimise for the entire process going down, then
we definitely don't want my CPU pinning idea.