Hi, On Tue, Oct 8, 2019 at 3:10 AM Roman Penyaev <rpenyaev@xxxxxxx> wrote: > > On 2019-10-07 20:43, Jason Baron wrote: > > On 10/7/19 2:30 PM, Roman Penyaev wrote: > >> On 2019-10-07 18:42, Jason Baron wrote: > >>> On 10/7/19 6:54 AM, Roman Penyaev wrote: > >>>> On 2019-10-03 18:13, Jason Baron wrote: > >>>>> On 9/30/19 7:55 AM, Roman Penyaev wrote: > >>>>>> On 2019-09-28 04:29, Andrew Morton wrote: > >>>>>>> On Wed, 25 Sep 2019 09:56:03 +0800 hev <r@xxxxxx> wrote: > >>>>>>> > >>>>>>>> From: Heiher <r@xxxxxx> > >>>>>>>> > >>>>>>>> Take the case where we have: > >>>>>>>> > >>>>>>>> t0 > >>>>>>>> | (ew) > >>>>>>>> e0 > >>>>>>>> | (et) > >>>>>>>> e1 > >>>>>>>> | (lt) > >>>>>>>> s0 > >>>>>>>> > >>>>>>>> t0: thread 0 > >>>>>>>> e0: epoll fd 0 > >>>>>>>> e1: epoll fd 1 > >>>>>>>> s0: socket fd 0 > >>>>>>>> ew: epoll_wait > >>>>>>>> et: edge-trigger > >>>>>>>> lt: level-trigger > >>>>>>>> > >>>>>>>> We only need to wakeup nested epoll fds if something has been > >>>>>>>> queued > >>>>>>>> to the > >>>>>>>> overflow list, since the ep_poll() traverses the rdllist during > >>>>>>>> recursive poll > >>>>>>>> and thus events on the overflow list may not be visible yet. > >>>>>>>> > >>>>>>>> Test code: > >>>>>>> > >>>>>>> Look sane to me. Do you have any performance testing results > >>>>>>> which > >>>>>>> show a benefit? > >>>>>>> > >>>>>>> epoll maintainership isn't exactly a hive of activity nowadays :( > >>>>>>> Roman, would you please have time to review this? > >>>>>> > >>>>>> So here is my observation: current patch does not fix the > >>>>>> described > >>>>>> problem (double wakeup) for the case, when new event comes exactly > >>>>>> to the ->ovflist. According to the patch this is the desired > >>>>>> intention: > >>>>>> > >>>>>> /* > >>>>>> * We only need to wakeup nested epoll fds if something has > >>>>>> been > >>>>>> queued > >>>>>> * to the overflow list, since the ep_poll() traverses the > >>>>>> rdllist > >>>>>> * during recursive poll and thus events on the overflow list > >>>>>> may > >>>>>> not be > >>>>>> * visible yet. > >>>>>> */ > >>>>>> if (nepi != NULL) > >>>>>> pwake++; > >>>>>> > >>>>>> .... > >>>>>> > >>>>>> if (pwake == 2) > >>>>>> ep_poll_safewake(&ep->poll_wait); > >>>>>> > >>>>>> > >>>>>> but this actually means that we repeat the same behavior (double > >>>>>> wakeup) > >>>>>> but only for the case, when event comes to the ->ovflist. > >>>>>> > >>>>>> How to reproduce? Can be easily done (ok, not so easy but it is > >>>>>> possible > >>>>>> to try): to the given userspace test we need to add one more > >>>>>> socket > >>>>>> and > >>>>>> immediately fire the event: > >>>>>> > >>>>>> e.events = EPOLLIN; > >>>>>> if (epoll_ctl(efd[1], EPOLL_CTL_ADD, s2fd[0], &e) < 0) > >>>>>> goto out; > >>>>>> > >>>>>> /* > >>>>>> * Signal any fd to let epoll_wait() to call > >>>>>> ep_scan_ready_list() > >>>>>> * in order to "catch" it there and add new event to > >>>>>> ->ovflist. > >>>>>> */ > >>>>>> if (write(s2fd[1], "w", 1) != 1) > >>>>>> goto out; > >>>>>> > >>>>>> That is done in order to let the following epoll_wait() call to > >>>>>> invoke > >>>>>> ep_scan_ready_list(), where we can "catch" and insert new event > >>>>>> exactly > >>>>>> to the ->ovflist. In order to insert event exactly in the correct > >>>>>> list > >>>>>> I introduce artificial delay. > >>>>>> > >>>>>> Modified test and kernel patch is below. Here is the output of > >>>>>> the > >>>>>> testing tool with some debug lines from kernel: > >>>>>> > >>>>>> # ~/devel/test/edge-bug > >>>>>> [ 59.263178] ### sleep 2 > >>>>>> >> write to sock > >>>>>> [ 61.318243] ### done sleep > >>>>>> [ 61.318991] !!!!!!!!!!! ep_poll_safewake(&ep->poll_wait); > >>>>>> events_in_rdllist=1, events_in_ovflist=1 > >>>>>> [ 61.321204] ### sleep 2 > >>>>>> [ 63.398325] ### done sleep > >>>>>> error: What?! Again?! > >>>>>> > >>>>>> First epoll_wait() call (ep_scan_ready_list()) observes 2 events > >>>>>> (see "!!!!!!!!!!! ep_poll_safewake" output line), exactly what we > >>>>>> wanted to achieve, so eventually ep_poll_safewake() is called > >>>>>> again > >>>>>> which leads to double wakeup. > >>>>>> > >>>>>> In my opinion current patch as it is should be dropped, it does > >>>>>> not > >>>>>> fix the described problem but just hides it. > >>>>>> > >>>>>> -- > >>>> > >>>> Hi Jason, > >>>> > >>>>> > >>>>> Yes, there are 2 wakeups in the test case you describe, but if the > >>>>> second event (write to s1fd) gets queued after the first call to > >>>>> epoll_wait(), we are going to get 2 wakeups anyways. > >>>> > >>>> Yes, exactly, for this reason I print out the number of events > >>>> observed > >>>> on first wait, there should be 1 (rdllist) and 1 (ovflist), > >>>> otherwise > >>>> this is another case, when second event comes exactly after first > >>>> wait, which is legitimate wakeup. > >>>> > >>>>> So yes, there may > >>>>> be a slightly bigger window with this patch for 2 wakeups, but its > >>>>> small > >>>>> and I tried to be conservative with the patch - I'd rather get an > >>>>> occasional 2nd wakeup then miss one. Trying to debug missing > >>>>> wakeups > >>>>> isn't always fun... > >>>>> > >>>>> That said, the reason for propagating events that end up on the > >>>>> overflow > >>>>> list was to prevent the race of the wakee not seeing events because > >>>>> they > >>>>> were still on the overflow list. In the testcase, imagine if there > >>>>> was a > >>>>> thread doing epoll_wait() on efd[0], and then a write happends on > >>>>> s1fd. > >>>>> I thought it was possible then that a 2nd thread doing epoll_wait() > >>>>> on > >>>>> efd[1], wakes up, checks efd[0] and sees no events because they are > >>>>> still potentially on the overflow list. However, I think that case > >>>>> is > >>>>> not possible because the thread doing epoll_wait() on efd[0] is > >>>>> going to > >>>>> have the ep->mtx, and thus when the thread wakes up on efd[1], its > >>>>> going > >>>>> to have to be ordered because its also grabbing the ep->mtx > >>>>> associated > >>>>> with efd[0]. > >>>>> > >>>>> So I think its safe to do the following if we want to go further > >>>>> than > >>>>> the proposed patch, which is what you suggested earlier in the > >>>>> thread > >>>>> (minus keeping the wakeup on ep->wq). > >>>> > >>>> Then I do not understand why we need to keep ep->wq wakeup? > >>>> @wq and @poll_wait are almost the same with only one difference: > >>>> @wq is used when you sleep on it inside epoll_wait() and the other > >>>> is used when you nest epoll fd inside epoll fd. Either you wake > >>>> both up either you don't this at all. > >>>> > >>>> ep_poll_callback() does wakeup explicitly, ep_insert() and > >>>> ep_modify() > >>>> do wakeup explicitly, so what are the cases when we need to do > >>>> wakeups > >>>> from ep_scan_ready_list()? > >>> > >>> Hi Roman, > >>> > >>> So the reason I was saying not to drop the ep->wq wakeup was that I > >>> was > >>> thinking about a usecase where you have multi-threads say thread A > >>> and > >>> thread B which are doing epoll_wait() on the same epfd. Now, the > >>> threads > >>> both call epoll_wait() and are added as exclusive to ep->wq. Now a > >>> bunch > >>> of events happen and thread A is woken up. However, thread A may only > >>> process a subset of the events due to its 'maxevents' parameter. In > >>> that > >>> case, I was thinking that the wakeup on ep->wq might be helpful, > >>> because > >>> in the absence of subsequent events, thread B can now start > >>> processing > >>> the rest, instead of waiting for the next event to be queued. > >>> > >>> However, I was thinking about the state of things before: > >>> 86c0517 fs/epoll: deal with wait_queue only once > >>> > >>> Before that patch, thread A would have been removed from eq->wq > >>> before > >>> the wakeup call, thus waking up thread B. However, now that thread A > >>> stays on the queue during the call to ep_send_events(), I believe the > >>> wakeup would only target thread A, which doesn't help since its > >>> already > >>> checking for events. So given the state of things I think you are > >>> right > >>> in that its not needed. However, I wonder if not removing from the > >>> ep->wq affects the multi-threaded case I described. Its been around > >>> since 5.0, so probably not, but it would be a more subtle performance > >>> difference. > >> > >> Now I understand what you mean. You want to prevent "idling" of > >> events, > >> while thread A is busy with the small portion of events (portion is > >> equal > >> to maxevents). On next iteration thread A will pick up the rest, no > >> doubts, > >> but would be nice to give a chance to thread B immediately to deal > >> with the > >> rest. Ok, makes sense. > > > > Exactly, I don't believe its racy as is - but it seems like it would be > > good to wakeup other threads that may be waiting. That said, this logic > > was removed as I pointed out. So I'm not sure we need to tie this > > change > > to the current one - but it may be a nice addition. > > > >> > >> But what if to make this wakeup explicit if we have more events to > >> process? > >> (nothing is tested, just a guess) > >> > >> @@ -255,6 +255,7 @@ struct ep_pqueue { > >> struct ep_send_events_data { > >> int maxevents; > >> struct epoll_event __user *events; > >> + bool have_more; > >> int res; > >> }; > >> @@ -1783,14 +1768,17 @@ static __poll_t ep_send_events_proc(struct > >> eventpoll *ep, struct list_head *head > >> } > >> > >> static int ep_send_events(struct eventpoll *ep, > >> - struct epoll_event __user *events, int > >> maxevents) > >> + struct epoll_event __user *events, int > >> maxevents, > >> + bool *have_more) > >> { > >> - struct ep_send_events_data esed; > >> - > >> - esed.maxevents = maxevents; > >> - esed.events = events; > >> + struct ep_send_events_data esed = { > >> + .maxevents = maxevents, > >> + .events = events, > >> + }; > >> > >> ep_scan_ready_list(ep, ep_send_events_proc, &esed, 0, false); > >> + *have_more = esed.have_more; > >> + > >> return esed.res; > >> } > >> > >> @@ -1827,7 +1815,7 @@ static int ep_poll(struct eventpoll *ep, struct > >> epoll_event __user *events, > >> { > >> int res = 0, eavail, timed_out = 0; > >> u64 slack = 0; > >> - bool waiter = false; > >> + bool waiter = false, have_more; > >> wait_queue_entry_t wait; > >> ktime_t expires, *to = NULL; > >> > >> @@ -1927,7 +1915,8 @@ static int ep_poll(struct eventpoll *ep, struct > >> epoll_event __user *events, > >> * more luck. > >> */ > >> if (!res && eavail && > >> - !(res = ep_send_events(ep, events, maxevents)) && > >> !timed_out) > >> + !(res = ep_send_events(ep, events, maxevents, &have_more)) > >> && > >> + !timed_out) > >> goto fetch_events; > >> > >> if (waiter) { > >> @@ -1935,6 +1924,12 @@ static int ep_poll(struct eventpoll *ep, struct > >> epoll_event __user *events, > >> __remove_wait_queue(&ep->wq, &wait); > >> spin_unlock_irq(&ep->wq.lock); > >> } > >> + /* > >> + * We were not able to process all the events, so immediately > >> + * wakeup other waiter. > >> + */ > >> + if (res > 0 && have_more && waitqueue_active(&ep->wq)) > >> + wake_up(&ep->wq); > >> > >> return res; > >> } > >> > >> > > > > Ok, yeah I like making it explicit. Looks like you are missing the > > changes to ep_scan_ready_list(), but I think the general approach makes > > sense. > > Yeah, missed the hunk: > > @@ -1719,8 +1704,10 @@ static __poll_t ep_send_events_proc(struct > eventpoll *ep, struct list_head *head > lockdep_assert_held(&ep->mtx); > > list_for_each_entry_safe(epi, tmp, head, rdllink) { > - if (esed->res >= esed->maxevents) > + if (esed->res >= esed->maxevents) { > + esed->have_more = true; > break; > + } > > > Although I really didn't have a test case that motivated this - > > its just was sort of noting this change in behavior while reviewing the > > current change. > > > >> PS. So what we decide with the original patch? Remove the whole > >> branch? > >> > > > > For fwiw, I'm ok removing the whole branch as you proposed. > > Then probably Heiher could resend once more. Heiher, can you, please? Sorry for delay. Thank you for your help! That's ok, I will re-send the patch and add unit-tests to kselftests in the later patches. > > > And I think the above change can go in separately (if we decide we want > > it). I don't > > think they need to be tied together. I also want to make sure this > > change gets a full linux-next cycle, so I think it should target 5.5 at > > this point. > > Sure, this explicit ->wq wakeup is a separate patch, which should be > covered with some benchmarks. I can try to cook something out in order > to get numbers. > > -- > Roman > -- Best regards! Hev https://hev.cc