Re: [PATCH v4 RESEND] epoll: use refcount to reduce ep_mutex contention

Paolo Abeni <pabeni@xxxxxxxxxx> · Wed, 08 Mar 2023 09:55:31 +0100

On Tue, 2023-03-07 at 13:30 -0800, Andrew Morton wrote:
> On Tue,  7 Mar 2023 19:46:37 +0100 Paolo Abeni <pabeni@xxxxxxxxxx> wrote:
> 
> > We are observing huge contention on the epmutex during an http
> > connection/rate test:
> > 
> >  83.17% 0.25%  nginx            [kernel.kallsyms]         [k] entry_SYSCALL_64_after_hwframe
> > [...]
> >            |--66.96%--__fput
> >                       |--60.04%--eventpoll_release_file
> >                                  |--58.41%--__mutex_lock.isra.6
> >                                            |--56.56%--osq_lock
> > 
> > The application is multi-threaded, creates a new epoll entry for
> > each incoming connection, and does not delete it before the
> > connection shutdown - that is, before the connection's fd close().
> > 
> > Many different threads compete frequently for the epmutex lock,
> > affecting the overall performance.
> > 
> > To reduce the contention this patch introduces explicit reference counting
> > for the eventpoll struct. Each registered event acquires a reference,
> > and references are released at ep_remove() time.
> > 
> > Additionally, this introduces a new 'dying' flag to prevent races between
> > the EP file close() and the monitored file close().
> > ep_eventpoll_release() marks, under f_lock spinlock, each epitem as before
> 
> "as dying"?
> 
> > removing it, while EP file close() does not touch dying epitems.
> 
> The need for this dying flag is somewhat unclear to me.  I mean, if we
> have refcounting done correctly, why the need for this flag?  Some
> additional description of the dynamics would be helpful.
> 
> Methinks this flag is here to cope with the delayed freeing via
> hlist_del_rcu(), but that's a guess?

First thing first, thanks for the feedback!

Both ep_clear_and_put() and eventpoll_release_file() can release the
eventpoll struct. The second must acquire the file->f_lock spinlock to
reach/access such struct pointer. Callers of __ep_remove need to
acquire first the ep->mtx, so eventpoll_release_file() must release the
spinlock after fetching the pointer and before acquiring the mutex.

Meanwhile, without the 'dying' flag, ep_clear_and_put() could kick-in,
eventually on a different CPU, drop all the ep references and free the
struct. 
An alternative to the 'dying' flag would be removing the following loop
from ep_clear_and_put():

	while ((rbp = rb_first_cached(&ep->rbr)) != NULL) {
                epi = rb_entry(rbp, struct epitem, rbn);
                ep_remove_safe(ep, epi);
                cond_resched();
        }

So that ep_clear_and_put() would not release all the ep references
anymore. That option has the downside of keeping the ep struct alive
for an unlimited time after ep_clear_and_put(). A previous revision of
this patch implemented a similar behavior, but Eric Biggers noted it
could hurt some users:

https://lore.kernel.org/linux-fsdevel/Y3%2F4FW4mqY3fWRfU@sol.localdomain/

Please let me know if the above is clear enough.

> > The eventpoll struct is released by whoever - among EP file close() and
> > and the monitored file close() drops its last reference.
> > 
> > With all the above in place, we can drop the epmutex usage at disposal time.
> > 
> > Overall this produces a significant performance improvement in the
> > mentioned connection/rate scenario: the mutex operations disappear from
> > the topmost offenders in the perf report, and the measured connections/rate
> > grows by ~60%.
> > 
> > To make the change more readable this additionally renames ep_free() to
> > ep_clear_and_put(), and moves the actual memory cleanup in a separate
> > ep_free() helper.
> > 
> > ...
> > 
> > --- a/fs/eventpoll.c
> > +++ b/fs/eventpoll.c
> > 
> > ...
> > 
> > +	free_uid(ep->user);
> > +	wakeup_source_unregister(ep->ws);
> > +	kfree(ep);
> > +}
> > +
> >  /*
> >   * Removes a "struct epitem" from the eventpoll RB tree and deallocates
> >   * all the associated resources. Must be called with "mtx" held.
> > + * If the dying flag is set, do the removal only if force is true.
> 
> This comment describes "what" the code does, which is obvious from the
> code anwyay.  It's better if comments describe "why" the code does what
> it does.

What about appending the following?

"""
This prevents ep_clear_and_put() from dropping all the ep references
while running concurrently with eventpoll_release_file().
"""

(I'll keep the 'what' part to hopefully make the 'why' more clear)

> > + * Returns true if the eventpoll can be disposed.
> >   */
> > -static int ep_remove(struct eventpoll *ep, struct epitem *epi)
> > +static bool __ep_remove(struct eventpoll *ep, struct epitem *epi, bool force)
> >  {
> >  	struct file *file = epi->ffd.file;
> >  	struct epitems_head *to_free;
> > 
> > ...
> > 
> >  	/*
> > -	 * We don't want to get "file->f_lock" because it is not
> > -	 * necessary. It is not necessary because we're in the "struct file"
> > -	 * cleanup path, and this means that no one is using this file anymore.
> > -	 * So, for example, epoll_ctl() cannot hit here since if we reach this
> > -	 * point, the file counter already went to zero and fget() would fail.
> > -	 * The only hit might come from ep_free() but by holding the mutex
> > -	 * will correctly serialize the operation. We do need to acquire
> > -	 * "ep->mtx" after "epmutex" because ep_remove() requires it when called
> > -	 * from anywhere but ep_free().
> > -	 *
> > -	 * Besides, ep_remove() acquires the lock, so we can't hold it here.
> > +	 * Use the 'dying' flag to prevent a concurrent ep_cleat_and_put() from
> 
> s/cleat/clear/
> 
> > +	 * touching the epitems list before eventpoll_release_file() can access
> > +	 * the ep->mtx.
> >  	 */
> > -	mutex_lock(&epmutex);
> > -	if (unlikely(!file->f_ep)) {
> > -		mutex_unlock(&epmutex);
> > -		return;
> > -	}
> > -	hlist_for_each_entry_safe(epi, next, file->f_ep, fllink) {
> > +again:
> > +	spin_lock(&file->f_lock);
> > +	if (file->f_ep && file->f_ep->first) {
> > +		/* detach from ep tree */
> 
> Comment appears to be misplaced - the following code doesn't detach
> anything?

Indeed. This is a left-over from a previous revision. Can be dropped.

I have a process question: I understand this is queued for the mm-
nonmm-unstable branch. Should I send a v5 with the above comments
changes or an incremental patch or something completely different?

Thanks!

Paolo