On 10/7/19 6:54 AM, Roman Penyaev wrote: > On 2019-10-03 18:13, Jason Baron wrote: >> On 9/30/19 7:55 AM, Roman Penyaev wrote: >>> On 2019-09-28 04:29, Andrew Morton wrote: >>>> On Wed, 25 Sep 2019 09:56:03 +0800 hev <r@xxxxxx> wrote: >>>> >>>>> From: Heiher <r@xxxxxx> >>>>> >>>>> Take the case where we have: >>>>> >>>>> t0 >>>>> | (ew) >>>>> e0 >>>>> | (et) >>>>> e1 >>>>> | (lt) >>>>> s0 >>>>> >>>>> t0: thread 0 >>>>> e0: epoll fd 0 >>>>> e1: epoll fd 1 >>>>> s0: socket fd 0 >>>>> ew: epoll_wait >>>>> et: edge-trigger >>>>> lt: level-trigger >>>>> >>>>> We only need to wakeup nested epoll fds if something has been queued >>>>> to the >>>>> overflow list, since the ep_poll() traverses the rdllist during >>>>> recursive poll >>>>> and thus events on the overflow list may not be visible yet. >>>>> >>>>> Test code: >>>> >>>> Look sane to me. Do you have any performance testing results which >>>> show a benefit? >>>> >>>> epoll maintainership isn't exactly a hive of activity nowadays :( >>>> Roman, would you please have time to review this? >>> >>> So here is my observation: current patch does not fix the described >>> problem (double wakeup) for the case, when new event comes exactly >>> to the ->ovflist. According to the patch this is the desired intention: >>> >>> /* >>> * We only need to wakeup nested epoll fds if something has been >>> queued >>> * to the overflow list, since the ep_poll() traverses the rdllist >>> * during recursive poll and thus events on the overflow list may >>> not be >>> * visible yet. >>> */ >>> if (nepi != NULL) >>> pwake++; >>> >>> .... >>> >>> if (pwake == 2) >>> ep_poll_safewake(&ep->poll_wait); >>> >>> >>> but this actually means that we repeat the same behavior (double wakeup) >>> but only for the case, when event comes to the ->ovflist. >>> >>> How to reproduce? Can be easily done (ok, not so easy but it is possible >>> to try): to the given userspace test we need to add one more socket and >>> immediately fire the event: >>> >>> e.events = EPOLLIN; >>> if (epoll_ctl(efd[1], EPOLL_CTL_ADD, s2fd[0], &e) < 0) >>> goto out; >>> >>> /* >>> * Signal any fd to let epoll_wait() to call ep_scan_ready_list() >>> * in order to "catch" it there and add new event to ->ovflist. >>> */ >>> if (write(s2fd[1], "w", 1) != 1) >>> goto out; >>> >>> That is done in order to let the following epoll_wait() call to invoke >>> ep_scan_ready_list(), where we can "catch" and insert new event exactly >>> to the ->ovflist. In order to insert event exactly in the correct list >>> I introduce artificial delay. >>> >>> Modified test and kernel patch is below. Here is the output of the >>> testing tool with some debug lines from kernel: >>> >>> # ~/devel/test/edge-bug >>> [ 59.263178] ### sleep 2 >>> >> write to sock >>> [ 61.318243] ### done sleep >>> [ 61.318991] !!!!!!!!!!! ep_poll_safewake(&ep->poll_wait); >>> events_in_rdllist=1, events_in_ovflist=1 >>> [ 61.321204] ### sleep 2 >>> [ 63.398325] ### done sleep >>> error: What?! Again?! >>> >>> First epoll_wait() call (ep_scan_ready_list()) observes 2 events >>> (see "!!!!!!!!!!! ep_poll_safewake" output line), exactly what we >>> wanted to achieve, so eventually ep_poll_safewake() is called again >>> which leads to double wakeup. >>> >>> In my opinion current patch as it is should be dropped, it does not >>> fix the described problem but just hides it. >>> >>> -- > > Hi Jason, > >> >> Yes, there are 2 wakeups in the test case you describe, but if the >> second event (write to s1fd) gets queued after the first call to >> epoll_wait(), we are going to get 2 wakeups anyways. > > Yes, exactly, for this reason I print out the number of events observed > on first wait, there should be 1 (rdllist) and 1 (ovflist), otherwise > this is another case, when second event comes exactly after first > wait, which is legitimate wakeup. > >> So yes, there may >> be a slightly bigger window with this patch for 2 wakeups, but its small >> and I tried to be conservative with the patch - I'd rather get an >> occasional 2nd wakeup then miss one. Trying to debug missing wakeups >> isn't always fun... >> >> That said, the reason for propagating events that end up on the overflow >> list was to prevent the race of the wakee not seeing events because they >> were still on the overflow list. In the testcase, imagine if there was a >> thread doing epoll_wait() on efd[0], and then a write happends on s1fd. >> I thought it was possible then that a 2nd thread doing epoll_wait() on >> efd[1], wakes up, checks efd[0] and sees no events because they are >> still potentially on the overflow list. However, I think that case is >> not possible because the thread doing epoll_wait() on efd[0] is going to >> have the ep->mtx, and thus when the thread wakes up on efd[1], its going >> to have to be ordered because its also grabbing the ep->mtx associated >> with efd[0]. >> >> So I think its safe to do the following if we want to go further than >> the proposed patch, which is what you suggested earlier in the thread >> (minus keeping the wakeup on ep->wq). > > Then I do not understand why we need to keep ep->wq wakeup? > @wq and @poll_wait are almost the same with only one difference: > @wq is used when you sleep on it inside epoll_wait() and the other > is used when you nest epoll fd inside epoll fd. Either you wake > both up either you don't this at all. > > ep_poll_callback() does wakeup explicitly, ep_insert() and ep_modify() > do wakeup explicitly, so what are the cases when we need to do wakeups > from ep_scan_ready_list()? Hi Roman, So the reason I was saying not to drop the ep->wq wakeup was that I was thinking about a usecase where you have multi-threads say thread A and thread B which are doing epoll_wait() on the same epfd. Now, the threads both call epoll_wait() and are added as exclusive to ep->wq. Now a bunch of events happen and thread A is woken up. However, thread A may only process a subset of the events due to its 'maxevents' parameter. In that case, I was thinking that the wakeup on ep->wq might be helpful, because in the absence of subsequent events, thread B can now start processing the rest, instead of waiting for the next event to be queued. However, I was thinking about the state of things before: 86c0517 fs/epoll: deal with wait_queue only once Before that patch, thread A would have been removed from eq->wq before the wakeup call, thus waking up thread B. However, now that thread A stays on the queue during the call to ep_send_events(), I believe the wakeup would only target thread A, which doesn't help since its already checking for events. So given the state of things I think you are right in that its not needed. However, I wonder if not removing from the ep->wq affects the multi-threaded case I described. Its been around since 5.0, so probably not, but it would be a more subtle performance difference. Thanks, -Jason > > I would still remove the whole branch: > > > --- a/fs/eventpoll.c > +++ b/fs/eventpoll.c > @@ -671,7 +671,6 @@ static __poll_t ep_scan_ready_list(struct eventpoll > *ep, > void *priv, int depth, bool ep_locked) > { > __poll_t res; > - int pwake = 0; > struct epitem *epi, *nepi; > LIST_HEAD(txlist); > > @@ -738,26 +737,11 @@ static __poll_t ep_scan_ready_list(struct > eventpoll *ep, > */ > list_splice(&txlist, &ep->rdllist); > __pm_relax(ep->ws); > - > - if (!list_empty(&ep->rdllist)) { > - /* > - * Wake up (if active) both the eventpoll wait list and > - * the ->poll() wait list (delayed after we release the > lock). > - */ > - if (waitqueue_active(&ep->wq)) > - wake_up(&ep->wq); > - if (waitqueue_active(&ep->poll_wait)) > - pwake++; > - } > write_unlock_irq(&ep->lock); > > if (!ep_locked) > mutex_unlock(&ep->mtx); > > - /* We have to call this outside the lock */ > - if (pwake) > - ep_poll_safewake(&ep->poll_wait); > - > return res; > } > > -- > Roman > > >