On Tue, 2023-03-07 at 13:30 -0800, Andrew Morton wrote: > On Tue, 7 Mar 2023 19:46:37 +0100 Paolo Abeni <pabeni@xxxxxxxxxx> wrote: > > > We are observing huge contention on the epmutex during an http > > connection/rate test: > > > > 83.17% 0.25% nginx [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe > > [...] > > |--66.96%--__fput > > |--60.04%--eventpoll_release_file > > |--58.41%--__mutex_lock.isra.6 > > |--56.56%--osq_lock > > > > The application is multi-threaded, creates a new epoll entry for > > each incoming connection, and does not delete it before the > > connection shutdown - that is, before the connection's fd close(). > > > > Many different threads compete frequently for the epmutex lock, > > affecting the overall performance. > > > > To reduce the contention this patch introduces explicit reference counting > > for the eventpoll struct. Each registered event acquires a reference, > > and references are released at ep_remove() time. > > > > Additionally, this introduces a new 'dying' flag to prevent races between > > the EP file close() and the monitored file close(). > > ep_eventpoll_release() marks, under f_lock spinlock, each epitem as before > > "as dying"? > > > removing it, while EP file close() does not touch dying epitems. > > The need for this dying flag is somewhat unclear to me. I mean, if we > have refcounting done correctly, why the need for this flag? Some > additional description of the dynamics would be helpful. > > Methinks this flag is here to cope with the delayed freeing via > hlist_del_rcu(), but that's a guess? First thing first, thanks for the feedback! Both ep_clear_and_put() and eventpoll_release_file() can release the eventpoll struct. The second must acquire the file->f_lock spinlock to reach/access such struct pointer. Callers of __ep_remove need to acquire first the ep->mtx, so eventpoll_release_file() must release the spinlock after fetching the pointer and before acquiring the mutex. Meanwhile, without the 'dying' flag, ep_clear_and_put() could kick-in, eventually on a different CPU, drop all the ep references and free the struct. An alternative to the 'dying' flag would be removing the following loop from ep_clear_and_put(): while ((rbp = rb_first_cached(&ep->rbr)) != NULL) { epi = rb_entry(rbp, struct epitem, rbn); ep_remove_safe(ep, epi); cond_resched(); } So that ep_clear_and_put() would not release all the ep references anymore. That option has the downside of keeping the ep struct alive for an unlimited time after ep_clear_and_put(). A previous revision of this patch implemented a similar behavior, but Eric Biggers noted it could hurt some users: https://lore.kernel.org/linux-fsdevel/Y3%2F4FW4mqY3fWRfU@sol.localdomain/ Please let me know if the above is clear enough. > > The eventpoll struct is released by whoever - among EP file close() and > > and the monitored file close() drops its last reference. > > > > With all the above in place, we can drop the epmutex usage at disposal time. > > > > Overall this produces a significant performance improvement in the > > mentioned connection/rate scenario: the mutex operations disappear from > > the topmost offenders in the perf report, and the measured connections/rate > > grows by ~60%. > > > > To make the change more readable this additionally renames ep_free() to > > ep_clear_and_put(), and moves the actual memory cleanup in a separate > > ep_free() helper. > > > > ... > > > > --- a/fs/eventpoll.c > > +++ b/fs/eventpoll.c > > > > ... > > > > + free_uid(ep->user); > > + wakeup_source_unregister(ep->ws); > > + kfree(ep); > > +} > > + > > /* > > * Removes a "struct epitem" from the eventpoll RB tree and deallocates > > * all the associated resources. Must be called with "mtx" held. > > + * If the dying flag is set, do the removal only if force is true. > > This comment describes "what" the code does, which is obvious from the > code anwyay. It's better if comments describe "why" the code does what > it does. What about appending the following? """ This prevents ep_clear_and_put() from dropping all the ep references while running concurrently with eventpoll_release_file(). """ (I'll keep the 'what' part to hopefully make the 'why' more clear) > > + * Returns true if the eventpoll can be disposed. > > */ > > -static int ep_remove(struct eventpoll *ep, struct epitem *epi) > > +static bool __ep_remove(struct eventpoll *ep, struct epitem *epi, bool force) > > { > > struct file *file = epi->ffd.file; > > struct epitems_head *to_free; > > > > ... > > > > /* > > - * We don't want to get "file->f_lock" because it is not > > - * necessary. It is not necessary because we're in the "struct file" > > - * cleanup path, and this means that no one is using this file anymore. > > - * So, for example, epoll_ctl() cannot hit here since if we reach this > > - * point, the file counter already went to zero and fget() would fail. > > - * The only hit might come from ep_free() but by holding the mutex > > - * will correctly serialize the operation. We do need to acquire > > - * "ep->mtx" after "epmutex" because ep_remove() requires it when called > > - * from anywhere but ep_free(). > > - * > > - * Besides, ep_remove() acquires the lock, so we can't hold it here. > > + * Use the 'dying' flag to prevent a concurrent ep_cleat_and_put() from > > s/cleat/clear/ > > > + * touching the epitems list before eventpoll_release_file() can access > > + * the ep->mtx. > > */ > > - mutex_lock(&epmutex); > > - if (unlikely(!file->f_ep)) { > > - mutex_unlock(&epmutex); > > - return; > > - } > > - hlist_for_each_entry_safe(epi, next, file->f_ep, fllink) { > > +again: > > + spin_lock(&file->f_lock); > > + if (file->f_ep && file->f_ep->first) { > > + /* detach from ep tree */ > > Comment appears to be misplaced - the following code doesn't detach > anything? Indeed. This is a left-over from a previous revision. Can be dropped. I have a process question: I understand this is queued for the mm- nonmm-unstable branch. Should I send a v5 with the above comments changes or an incremental patch or something completely different? Thanks! Paolo