On 10/7/19 2:30 PM, Roman Penyaev wrote: > On 2019-10-07 18:42, Jason Baron wrote: >> On 10/7/19 6:54 AM, Roman Penyaev wrote: >>> On 2019-10-03 18:13, Jason Baron wrote: >>>> On 9/30/19 7:55 AM, Roman Penyaev wrote: >>>>> On 2019-09-28 04:29, Andrew Morton wrote: >>>>>> On Wed, 25 Sep 2019 09:56:03 +0800 hev <r@xxxxxx> wrote: >>>>>> >>>>>>> From: Heiher <r@xxxxxx> >>>>>>> >>>>>>> Take the case where we have: >>>>>>> >>>>>>> t0 >>>>>>> | (ew) >>>>>>> e0 >>>>>>> | (et) >>>>>>> e1 >>>>>>> | (lt) >>>>>>> s0 >>>>>>> >>>>>>> t0: thread 0 >>>>>>> e0: epoll fd 0 >>>>>>> e1: epoll fd 1 >>>>>>> s0: socket fd 0 >>>>>>> ew: epoll_wait >>>>>>> et: edge-trigger >>>>>>> lt: level-trigger >>>>>>> >>>>>>> We only need to wakeup nested epoll fds if something has been queued >>>>>>> to the >>>>>>> overflow list, since the ep_poll() traverses the rdllist during >>>>>>> recursive poll >>>>>>> and thus events on the overflow list may not be visible yet. >>>>>>> >>>>>>> Test code: >>>>>> >>>>>> Look sane to me. Do you have any performance testing results which >>>>>> show a benefit? >>>>>> >>>>>> epoll maintainership isn't exactly a hive of activity nowadays :( >>>>>> Roman, would you please have time to review this? >>>>> >>>>> So here is my observation: current patch does not fix the described >>>>> problem (double wakeup) for the case, when new event comes exactly >>>>> to the ->ovflist. According to the patch this is the desired >>>>> intention: >>>>> >>>>> /* >>>>> * We only need to wakeup nested epoll fds if something has been >>>>> queued >>>>> * to the overflow list, since the ep_poll() traverses the rdllist >>>>> * during recursive poll and thus events on the overflow list may >>>>> not be >>>>> * visible yet. >>>>> */ >>>>> if (nepi != NULL) >>>>> pwake++; >>>>> >>>>> .... >>>>> >>>>> if (pwake == 2) >>>>> ep_poll_safewake(&ep->poll_wait); >>>>> >>>>> >>>>> but this actually means that we repeat the same behavior (double >>>>> wakeup) >>>>> but only for the case, when event comes to the ->ovflist. >>>>> >>>>> How to reproduce? Can be easily done (ok, not so easy but it is >>>>> possible >>>>> to try): to the given userspace test we need to add one more socket >>>>> and >>>>> immediately fire the event: >>>>> >>>>> e.events = EPOLLIN; >>>>> if (epoll_ctl(efd[1], EPOLL_CTL_ADD, s2fd[0], &e) < 0) >>>>> goto out; >>>>> >>>>> /* >>>>> * Signal any fd to let epoll_wait() to call ep_scan_ready_list() >>>>> * in order to "catch" it there and add new event to ->ovflist. >>>>> */ >>>>> if (write(s2fd[1], "w", 1) != 1) >>>>> goto out; >>>>> >>>>> That is done in order to let the following epoll_wait() call to invoke >>>>> ep_scan_ready_list(), where we can "catch" and insert new event >>>>> exactly >>>>> to the ->ovflist. In order to insert event exactly in the correct list >>>>> I introduce artificial delay. >>>>> >>>>> Modified test and kernel patch is below. Here is the output of the >>>>> testing tool with some debug lines from kernel: >>>>> >>>>> # ~/devel/test/edge-bug >>>>> [ 59.263178] ### sleep 2 >>>>> >> write to sock >>>>> [ 61.318243] ### done sleep >>>>> [ 61.318991] !!!!!!!!!!! ep_poll_safewake(&ep->poll_wait); >>>>> events_in_rdllist=1, events_in_ovflist=1 >>>>> [ 61.321204] ### sleep 2 >>>>> [ 63.398325] ### done sleep >>>>> error: What?! Again?! >>>>> >>>>> First epoll_wait() call (ep_scan_ready_list()) observes 2 events >>>>> (see "!!!!!!!!!!! ep_poll_safewake" output line), exactly what we >>>>> wanted to achieve, so eventually ep_poll_safewake() is called again >>>>> which leads to double wakeup. >>>>> >>>>> In my opinion current patch as it is should be dropped, it does not >>>>> fix the described problem but just hides it. >>>>> >>>>> -- >>> >>> Hi Jason, >>> >>>> >>>> Yes, there are 2 wakeups in the test case you describe, but if the >>>> second event (write to s1fd) gets queued after the first call to >>>> epoll_wait(), we are going to get 2 wakeups anyways. >>> >>> Yes, exactly, for this reason I print out the number of events observed >>> on first wait, there should be 1 (rdllist) and 1 (ovflist), otherwise >>> this is another case, when second event comes exactly after first >>> wait, which is legitimate wakeup. >>> >>>> So yes, there may >>>> be a slightly bigger window with this patch for 2 wakeups, but its >>>> small >>>> and I tried to be conservative with the patch - I'd rather get an >>>> occasional 2nd wakeup then miss one. Trying to debug missing wakeups >>>> isn't always fun... >>>> >>>> That said, the reason for propagating events that end up on the >>>> overflow >>>> list was to prevent the race of the wakee not seeing events because >>>> they >>>> were still on the overflow list. In the testcase, imagine if there >>>> was a >>>> thread doing epoll_wait() on efd[0], and then a write happends on s1fd. >>>> I thought it was possible then that a 2nd thread doing epoll_wait() on >>>> efd[1], wakes up, checks efd[0] and sees no events because they are >>>> still potentially on the overflow list. However, I think that case is >>>> not possible because the thread doing epoll_wait() on efd[0] is >>>> going to >>>> have the ep->mtx, and thus when the thread wakes up on efd[1], its >>>> going >>>> to have to be ordered because its also grabbing the ep->mtx associated >>>> with efd[0]. >>>> >>>> So I think its safe to do the following if we want to go further than >>>> the proposed patch, which is what you suggested earlier in the thread >>>> (minus keeping the wakeup on ep->wq). >>> >>> Then I do not understand why we need to keep ep->wq wakeup? >>> @wq and @poll_wait are almost the same with only one difference: >>> @wq is used when you sleep on it inside epoll_wait() and the other >>> is used when you nest epoll fd inside epoll fd. Either you wake >>> both up either you don't this at all. >>> >>> ep_poll_callback() does wakeup explicitly, ep_insert() and ep_modify() >>> do wakeup explicitly, so what are the cases when we need to do wakeups >>> from ep_scan_ready_list()? >> >> Hi Roman, >> >> So the reason I was saying not to drop the ep->wq wakeup was that I was >> thinking about a usecase where you have multi-threads say thread A and >> thread B which are doing epoll_wait() on the same epfd. Now, the threads >> both call epoll_wait() and are added as exclusive to ep->wq. Now a bunch >> of events happen and thread A is woken up. However, thread A may only >> process a subset of the events due to its 'maxevents' parameter. In that >> case, I was thinking that the wakeup on ep->wq might be helpful, because >> in the absence of subsequent events, thread B can now start processing >> the rest, instead of waiting for the next event to be queued. >> >> However, I was thinking about the state of things before: >> 86c0517 fs/epoll: deal with wait_queue only once >> >> Before that patch, thread A would have been removed from eq->wq before >> the wakeup call, thus waking up thread B. However, now that thread A >> stays on the queue during the call to ep_send_events(), I believe the >> wakeup would only target thread A, which doesn't help since its already >> checking for events. So given the state of things I think you are right >> in that its not needed. However, I wonder if not removing from the >> ep->wq affects the multi-threaded case I described. Its been around >> since 5.0, so probably not, but it would be a more subtle performance >> difference. > > Now I understand what you mean. You want to prevent "idling" of events, > while thread A is busy with the small portion of events (portion is equal > to maxevents). On next iteration thread A will pick up the rest, no > doubts, > but would be nice to give a chance to thread B immediately to deal with the > rest. Ok, makes sense. Exactly, I don't believe its racy as is - but it seems like it would be good to wakeup other threads that may be waiting. That said, this logic was removed as I pointed out. So I'm not sure we need to tie this change to the current one - but it may be a nice addition. > > But what if to make this wakeup explicit if we have more events to process? > (nothing is tested, just a guess) > > @@ -255,6 +255,7 @@ struct ep_pqueue { > struct ep_send_events_data { > int maxevents; > struct epoll_event __user *events; > + bool have_more; > int res; > }; > @@ -1783,14 +1768,17 @@ static __poll_t ep_send_events_proc(struct > eventpoll *ep, struct list_head *head > } > > static int ep_send_events(struct eventpoll *ep, > - struct epoll_event __user *events, int maxevents) > + struct epoll_event __user *events, int maxevents, > + bool *have_more) > { > - struct ep_send_events_data esed; > - > - esed.maxevents = maxevents; > - esed.events = events; > + struct ep_send_events_data esed = { > + .maxevents = maxevents, > + .events = events, > + }; > > ep_scan_ready_list(ep, ep_send_events_proc, &esed, 0, false); > + *have_more = esed.have_more; > + > return esed.res; > } > > @@ -1827,7 +1815,7 @@ static int ep_poll(struct eventpoll *ep, struct > epoll_event __user *events, > { > int res = 0, eavail, timed_out = 0; > u64 slack = 0; > - bool waiter = false; > + bool waiter = false, have_more; > wait_queue_entry_t wait; > ktime_t expires, *to = NULL; > > @@ -1927,7 +1915,8 @@ static int ep_poll(struct eventpoll *ep, struct > epoll_event __user *events, > * more luck. > */ > if (!res && eavail && > - !(res = ep_send_events(ep, events, maxevents)) && !timed_out) > + !(res = ep_send_events(ep, events, maxevents, &have_more)) && > + !timed_out) > goto fetch_events; > > if (waiter) { > @@ -1935,6 +1924,12 @@ static int ep_poll(struct eventpoll *ep, struct > epoll_event __user *events, > __remove_wait_queue(&ep->wq, &wait); > spin_unlock_irq(&ep->wq.lock); > } > + /* > + * We were not able to process all the events, so immediately > + * wakeup other waiter. > + */ > + if (res > 0 && have_more && waitqueue_active(&ep->wq)) > + wake_up(&ep->wq); > > return res; > } > > Ok, yeah I like making it explicit. Looks like you are missing the changes to ep_scan_ready_list(), but I think the general approach makes sense. Although I really didn't have a test case that motivated this - its just was sort of noting this change in behavior while reviewing the current change. > PS. So what we decide with the original patch? Remove the whole branch? > For fwiw, I'm ok removing the whole branch as you proposed. And I think the above change can go in separately (if we decide we want it). I don't think they need to be tied together. I also want to make sure this change gets a full linux-next cycle, so I think it should target 5.5 at this point. Thanks, -Jason