On 9/5/19 1:27 PM, Roman Penyaev wrote: > On 2019-09-05 11:56, Heiher wrote: >> Hi, >> >> On Thu, Sep 5, 2019 at 10:53 AM Heiher <r@xxxxxx> wrote: >>> >>> Hi, >>> >>> I created an epoll wakeup test project, listed some possible cases, >>> and any other corner cases needs to be added? >>> >>> https://github.com/heiher/epoll-wakeup/blob/master/README.md >>> >>> On Wed, Sep 4, 2019 at 10:02 PM Heiher <r@xxxxxx> wrote: >>> > >>> > Hi, >>> > >>> > On Wed, Sep 4, 2019 at 8:02 PM Jason Baron <jbaron@xxxxxxxxxx> wrote: >>> > > >>> > > >>> > > >>> > > On 9/4/19 5:57 AM, Roman Penyaev wrote: >>> > > > On 2019-09-03 23:08, Jason Baron wrote: >>> > > >> On 9/2/19 11:36 AM, Roman Penyaev wrote: >>> > > >>> Hi, >>> > > >>> >>> > > >>> This is indeed a bug. (quick side note: could you please >>> remove efd[1] >>> > > >>> from your test, because it is not related to the reproduction >>> of a >>> > > >>> current bug). >>> > > >>> >>> > > >>> Your patch lacks a good description, what exactly you've >>> fixed. Let >>> > > >>> me speak out loud and please correct me if I'm wrong, my >>> understanding >>> > > >>> of epoll internals has become a bit rusty: when epoll fds are >>> nested >>> > > >>> an attempt to harvest events (ep_scan_ready_list() call) >>> produces a >>> > > >>> second (repeated) event from an internal fd up to an external >>> fd: >>> > > >>> >>> > > >>> epoll_wait(efd[0], ...): >>> > > >>> ep_send_events(): >>> > > >>> ep_scan_ready_list(depth=0): >>> > > >>> ep_send_events_proc(): >>> > > >>> ep_item_poll(): >>> > > >>> ep_scan_ready_list(depth=1): >>> > > >>> ep_poll_safewake(): >>> > > >>> ep_poll_callback() >>> > > >>> list_add_tail(&epi, &epi->rdllist); >>> > > >>> ^^^^^^ >>> > > >>> repeated event >>> > > >>> >>> > > >>> >>> > > >>> In your patch you forbid wakeup for the cases, where depth != >>> 0, i.e. >>> > > >>> for all nested cases. That seems clear. But what if we can >>> go further >>> > > >>> and remove the whole chunk, which seems excessive: >>> > > >>> >>> > > >>> @@ -885,26 +886,11 @@ static __poll_t ep_scan_ready_list(struct >>> > > >>> eventpoll *ep, >>> > > >>> >>> > > >>> - >>> > > >>> - if (!list_empty(&ep->rdllist)) { >>> > > >>> - /* >>> > > >>> - * Wake up (if active) both the eventpoll >>> wait list and >>> > > >>> - * the ->poll() wait list (delayed after we >>> release the >>> > > >>> lock). >>> > > >>> - */ >>> > > >>> - if (waitqueue_active(&ep->wq)) >>> > > >>> - wake_up(&ep->wq); >>> > > >>> - if (waitqueue_active(&ep->poll_wait)) >>> > > >>> - pwake++; >>> > > >>> - } >>> > > >>> write_unlock_irq(&ep->lock); >>> > > >>> >>> > > >>> if (!ep_locked) >>> > > >>> mutex_unlock(&ep->mtx); >>> > > >>> >>> > > >>> - /* We have to call this outside the lock */ >>> > > >>> - if (pwake) >>> > > >>> - ep_poll_safewake(&ep->poll_wait); >>> > > >>> >>> > > >>> >>> > > >>> I reason like that: by the time we've reached the point of >>> scanning events >>> > > >>> for readiness all wakeups from ep_poll_callback have been >>> already fired and >>> > > >>> new events have been already accounted in ready list >>> (ep_poll_callback() >>> > > >>> calls >>> > > >>> the same ep_poll_safewake()). Here, frankly, I'm not 100% >>> sure and probably >>> > > >>> missing some corner cases. >>> > > >>> >>> > > >>> Thoughts? >>> > > >> >>> > > >> So the: 'wake_up(&ep->wq);' part, I think is about waking up >>> other >>> > > >> threads that may be in waiting in epoll_wait(). For example, >>> there may >>> > > >> be multiple threads doing epoll_wait() on the same epoll fd, >>> and the >>> > > >> logic above seems to say thread 1 may have processed say N >>> events and >>> > > >> now its going to to go off to work those, so let's wake up >>> thread 2 now >>> > > >> to handle the next chunk. >>> > > > >>> > > > Not quite. Thread which calls ep_scan_ready_list() processes >>> all the >>> > > > events, and while processing those, removes them one by one >>> from the >>> > > > ready list. But if event mask is !0 and event belongs to >>> > > > Level Triggered Mode descriptor (let's say default mode) it >>> tails event >>> > > > again back to the list (because we are in level mode, so event >>> should >>> > > > be there). So at the end of this traversing loop ready list is >>> likely >>> > > > not empty, and if so, wake up again is called for nested epoll >>> fds. >>> > > > But, those nested epoll fds should get already all the >>> notifications >>> > > > from the main event callback ep_poll_callback(), regardless any >>> thread >>> > > > which traverses events. >>> > > > >>> > > > I suppose this logic exists for decades, when Davide (the >>> author) was >>> > > > reshuffling the code here and there. >>> > > > >>> > > > But I do not feel confidence to state that this extra wakeup is >>> bogus, >>> > > > I just have a gut feeling that it looks excessive. >>> > > >>> > > Note that I was talking about the wakeup done on ep->wq not >>> ep->poll_wait. >>> > > The path that I'm concerned about is let's say that there are N >>> events >>> > > queued on the ready list. A thread that was woken up in >>> epoll_wait may >>> > > decide to only process say N/2 of then. Then it will call wakeup >>> on ep->wq >>> > > and this will wakeup another thread to process the remaining N/2. >>> Without >>> > > the wakeup, the original thread isn't going to process the events >>> until >>> > > it finishes with the original N/2 and gets back to epoll_wait(). >>> So I'm not >>> > > sure how important that path is but I wanted to at least note the >>> change >>> > > here would impact that behavior. >>> > > >>> > > Thanks, >>> > > >>> > > -Jason >>> > > >>> > > >>> > > > >>> > > >> So I think removing all that even for the >>> > > >> depth 0 case is going to change some behavior here. So >>> perhaps, it >>> > > >> should be removed for all depths except for 0? And if so, it >>> may be >>> > > >> better to make 2 patches here to separate these changes. >>> > > >> >>> > > >> For the nested wakeups, I agree that the extra wakeups seem >>> unnecessary >>> > > >> and it may make sense to remove them for all depths. I don't >>> think the >>> > > >> nested epoll semantics are particularly well spelled out, and >>> afaict, >>> > > >> nested epoll() has behaved this way for quite some time. And >>> the current >>> > > >> behavior is not bad in the way that a missing wakeup or false >>> negative >>> > > >> would be. >>> > > > >>> > > > That's 100% true! For edge mode extra wake up is not a bug, not >>> optimal >>> > > > for userspace - yes, but that can't lead to any lost wakeups. >>> > > > >>> > > > -- >>> > > > Roman >>> > > > >>> > >>> > I tried to remove the whole chunk of code that Roman said, and it >>> > seems that there >>> > are no obvious problems with the two test programs below: >> >> I recall this message, the test case 9/25/26 of epoll-wakeup (on >> github) are failed while >> the whole chunk are removed. >> >> Apply the original patch, all tests passed. > > > These are failing on my bare 5.2.0-rc2 > > TEST bin/epoll31 FAIL > TEST bin/epoll46 FAIL > TEST bin/epoll50 FAIL > TEST bin/epoll32 FAIL > TEST bin/epoll19 FAIL > TEST bin/epoll27 FAIL > TEST bin/epoll42 FAIL > TEST bin/epoll34 FAIL > TEST bin/epoll48 FAIL > TEST bin/epoll40 FAIL > TEST bin/epoll20 FAIL > TEST bin/epoll28 FAIL > TEST bin/epoll38 FAIL > TEST bin/epoll52 FAIL > TEST bin/epoll24 FAIL > TEST bin/epoll23 FAIL > > > These are failing if your patch is applied: > (my 5.2.0-rc2 is old? broken?) > > TEST bin/epoll46 FAIL > TEST bin/epoll42 FAIL > TEST bin/epoll34 FAIL > TEST bin/epoll48 FAIL > TEST bin/epoll40 FAIL > TEST bin/epoll44 FAIL > TEST bin/epoll38 FAIL > > These are failing if "ep_poll_safewake(&ep->poll_wait)" is not called, > but wakeup(&ep->wq); is still invoked: > > TEST bin/epoll46 FAIL > TEST bin/epoll42 FAIL > TEST bin/epoll34 FAIL > TEST bin/epoll40 FAIL > TEST bin/epoll44 FAIL > TEST bin/epoll38 FAIL > > So at least 48 has been "fixed". > > These are failing if the whole chunk is removed, like your > said 9,25,26 are among which do not pass: > > TEST bin/epoll26 FAIL > TEST bin/epoll42 FAIL > TEST bin/epoll34 FAIL > TEST bin/epoll9 FAIL > TEST bin/epoll48 FAIL > TEST bin/epoll40 FAIL > TEST bin/epoll25 FAIL > TEST bin/epoll44 FAIL > TEST bin/epoll38 FAIL > > This can be a good test suite, probably can be added to kselftests? > > -- > Roman > Indeed, I just tried the same test suite and I am seeing similar failures - it looks like its a bit timing dependent. It looks like all the failures are caused by a similar issue. For example, take epoll34: t0 t1 (ew) | | (ew) e0 | (lt) \ / | e1 | (et) s0 The test is trying to assert that an epoll_wait() on e1 and and epoll_wait() on e0 both return 1 event for EPOLLIN. However, the epoll_wait on e1 is done in a thread and it can happen before or after the epoll_wait() is called against e0. If the epoll_wait() on e1 happens first then because its attached as 'et', it consumes the event. So that there is no longer an event reported at e0. I think that is reasonable semantics here. However if the wait on e0 happens after the wait on e1 then the test will pass as both waits will see the event. Thus, a patch like this will 'fix' this testcase: --- a/src/epoll34.c +++ b/src/epoll34.c @@ -59,15 +59,15 @@ int main(int argc, char *argv[]) if (epoll_ctl(efd[0], EPOLL_CTL_ADD, efd[1], &e) < 0) goto out; - if (pthread_create(&tw, NULL, thread_handler, NULL) < 0) - goto out; - if (pthread_create(&te, NULL, emit_handler, NULL) < 0) goto out; if (epoll_wait(efd[0], &e, 1, 500) == 1) count++; + if (pthread_create(&tw, NULL, thread_handler, NULL) < 0) + goto out; + if (pthread_join(tw, NULL) < 0) goto out; I found all the other failures to be of similar origin. I suspect Heiher didn't see failures due to the thread timings here. I also found that all the testcases pass if we leave the wakeup(&ep->wq) call for the depth 0 case (and remove the pwake part). Thanks, -Jason