On Wed, 15 Jun 2022 14:24:23 -0700 Benjamin Segall <bsegall@xxxxxxxxxx> wrote: > If a process is killed or otherwise exits while having active network > connections and many threads waiting on epoll_wait, the threads will all > be woken immediately, but not removed from ep->wq. Then when network > traffic scans ep->wq in wake_up, every wakeup attempt will fail, and > will not remove the entries from the list. > > This means that the cost of the wakeup attempt is far higher than usual, > does not decrease, and this also competes with the dying threads trying > to actually make progress and remove themselves from the wq. > > Handle this by removing visited epoll wq entries unconditionally, rather > than only when the wakeup succeeds - the structure of ep_poll means that > the only potential loss is the timed_out->eavail heuristic, which now > can race and result in a redundant ep_send_events attempt. (But only > when incoming data and a timeout actually race, not on every timeout) > Thanks. I added people from 412895f03cbf96 ("epoll: atomically remove wait entry on wake up") to cc. Hopefully someone there can help review and maybe test this. > > diff --git a/fs/eventpoll.c b/fs/eventpoll.c > index e2daa940ebce..8b56b94e2f56 100644 > --- a/fs/eventpoll.c > +++ b/fs/eventpoll.c > @@ -1745,10 +1745,25 @@ static struct timespec64 *ep_timeout_to_timespec(struct timespec64 *to, long ms) > ktime_get_ts64(&now); > *to = timespec64_add_safe(now, *to); > return to; > } > > +/* > + * autoremove_wake_function, but remove even on failure to wake up, because we > + * know that default_wake_function/ttwu will only fail if the thread is already > + * woken, and in that case the ep_poll loop will remove the entry anyways, not > + * try to reuse it. > + */ > +static int ep_autoremove_wake_function(struct wait_queue_entry *wq_entry, > + unsigned int mode, int sync, void *key) > +{ > + int ret = default_wake_function(wq_entry, mode, sync, key); > + > + list_del_init(&wq_entry->entry); > + return ret; > +} > + > /** > * ep_poll - Retrieves ready events, and delivers them to the caller-supplied > * event buffer. > * > * @ep: Pointer to the eventpoll context. > @@ -1826,12 +1841,19 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, > * chance to harvest new event. Otherwise wakeup can be > * lost. This is also good performance-wise, because on > * normal wakeup path no need to call __remove_wait_queue() > * explicitly, thus ep->lock is not taken, which halts the > * event delivery. > + * > + * In fact, we now use an even more aggressive function that > + * unconditionally removes, because we don't reuse the wait > + * entry between loop iterations. This lets us also avoid the > + * performance issue if a process is killed, causing all of its > + * threads to wake up without being removed normally. > */ > init_wait(&wait); > + wait.func = ep_autoremove_wake_function; > > write_lock_irq(&ep->lock); > /* > * Barrierless variant, waitqueue_active() is called under > * the same lock on wakeup ep_poll_callback() side, so it > -- > 2.36.1.476.g0c4daa206d-goog