Hi, On Fri, Sep 6, 2019 at 1:48 AM Jason Baron <jbaron@xxxxxxxxxx> wrote: > > > > On 9/5/19 1:27 PM, Roman Penyaev wrote: > > On 2019-09-05 11:56, Heiher wrote: > >> Hi, > >> > >> On Thu, Sep 5, 2019 at 10:53 AM Heiher <r@xxxxxx> wrote: > >>> > >>> Hi, > >>> > >>> I created an epoll wakeup test project, listed some possible cases, > >>> and any other corner cases needs to be added? > >>> > >>> https://github.com/heiher/epoll-wakeup/blob/master/README.md > >>> > >>> On Wed, Sep 4, 2019 at 10:02 PM Heiher <r@xxxxxx> wrote: > >>> > > >>> > Hi, > >>> > > >>> > On Wed, Sep 4, 2019 at 8:02 PM Jason Baron <jbaron@xxxxxxxxxx> wrote: > >>> > > > >>> > > > >>> > > > >>> > > On 9/4/19 5:57 AM, Roman Penyaev wrote: > >>> > > > On 2019-09-03 23:08, Jason Baron wrote: > >>> > > >> On 9/2/19 11:36 AM, Roman Penyaev wrote: > >>> > > >>> Hi, > >>> > > >>> > >>> > > >>> This is indeed a bug. (quick side note: could you please > >>> remove efd[1] > >>> > > >>> from your test, because it is not related to the reproduction > >>> of a > >>> > > >>> current bug). > >>> > > >>> > >>> > > >>> Your patch lacks a good description, what exactly you've > >>> fixed. Let > >>> > > >>> me speak out loud and please correct me if I'm wrong, my > >>> understanding > >>> > > >>> of epoll internals has become a bit rusty: when epoll fds are > >>> nested > >>> > > >>> an attempt to harvest events (ep_scan_ready_list() call) > >>> produces a > >>> > > >>> second (repeated) event from an internal fd up to an external > >>> fd: > >>> > > >>> > >>> > > >>> epoll_wait(efd[0], ...): > >>> > > >>> ep_send_events(): > >>> > > >>> ep_scan_ready_list(depth=0): > >>> > > >>> ep_send_events_proc(): > >>> > > >>> ep_item_poll(): > >>> > > >>> ep_scan_ready_list(depth=1): > >>> > > >>> ep_poll_safewake(): > >>> > > >>> ep_poll_callback() > >>> > > >>> list_add_tail(&epi, &epi->rdllist); > >>> > > >>> ^^^^^^ > >>> > > >>> repeated event > >>> > > >>> > >>> > > >>> > >>> > > >>> In your patch you forbid wakeup for the cases, where depth != > >>> 0, i.e. > >>> > > >>> for all nested cases. That seems clear. But what if we can > >>> go further > >>> > > >>> and remove the whole chunk, which seems excessive: > >>> > > >>> > >>> > > >>> @@ -885,26 +886,11 @@ static __poll_t ep_scan_ready_list(struct > >>> > > >>> eventpoll *ep, > >>> > > >>> > >>> > > >>> - > >>> > > >>> - if (!list_empty(&ep->rdllist)) { > >>> > > >>> - /* > >>> > > >>> - * Wake up (if active) both the eventpoll > >>> wait list and > >>> > > >>> - * the ->poll() wait list (delayed after we > >>> release the > >>> > > >>> lock). > >>> > > >>> - */ > >>> > > >>> - if (waitqueue_active(&ep->wq)) > >>> > > >>> - wake_up(&ep->wq); > >>> > > >>> - if (waitqueue_active(&ep->poll_wait)) > >>> > > >>> - pwake++; > >>> > > >>> - } > >>> > > >>> write_unlock_irq(&ep->lock); > >>> > > >>> > >>> > > >>> if (!ep_locked) > >>> > > >>> mutex_unlock(&ep->mtx); > >>> > > >>> > >>> > > >>> - /* We have to call this outside the lock */ > >>> > > >>> - if (pwake) > >>> > > >>> - ep_poll_safewake(&ep->poll_wait); > >>> > > >>> > >>> > > >>> > >>> > > >>> I reason like that: by the time we've reached the point of > >>> scanning events > >>> > > >>> for readiness all wakeups from ep_poll_callback have been > >>> already fired and > >>> > > >>> new events have been already accounted in ready list > >>> (ep_poll_callback() > >>> > > >>> calls > >>> > > >>> the same ep_poll_safewake()). Here, frankly, I'm not 100% > >>> sure and probably > >>> > > >>> missing some corner cases. > >>> > > >>> > >>> > > >>> Thoughts? > >>> > > >> > >>> > > >> So the: 'wake_up(&ep->wq);' part, I think is about waking up > >>> other > >>> > > >> threads that may be in waiting in epoll_wait(). For example, > >>> there may > >>> > > >> be multiple threads doing epoll_wait() on the same epoll fd, > >>> and the > >>> > > >> logic above seems to say thread 1 may have processed say N > >>> events and > >>> > > >> now its going to to go off to work those, so let's wake up > >>> thread 2 now > >>> > > >> to handle the next chunk. > >>> > > > > >>> > > > Not quite. Thread which calls ep_scan_ready_list() processes > >>> all the > >>> > > > events, and while processing those, removes them one by one > >>> from the > >>> > > > ready list. But if event mask is !0 and event belongs to > >>> > > > Level Triggered Mode descriptor (let's say default mode) it > >>> tails event > >>> > > > again back to the list (because we are in level mode, so event > >>> should > >>> > > > be there). So at the end of this traversing loop ready list is > >>> likely > >>> > > > not empty, and if so, wake up again is called for nested epoll > >>> fds. > >>> > > > But, those nested epoll fds should get already all the > >>> notifications > >>> > > > from the main event callback ep_poll_callback(), regardless any > >>> thread > >>> > > > which traverses events. > >>> > > > > >>> > > > I suppose this logic exists for decades, when Davide (the > >>> author) was > >>> > > > reshuffling the code here and there. > >>> > > > > >>> > > > But I do not feel confidence to state that this extra wakeup is > >>> bogus, > >>> > > > I just have a gut feeling that it looks excessive. > >>> > > > >>> > > Note that I was talking about the wakeup done on ep->wq not > >>> ep->poll_wait. > >>> > > The path that I'm concerned about is let's say that there are N > >>> events > >>> > > queued on the ready list. A thread that was woken up in > >>> epoll_wait may > >>> > > decide to only process say N/2 of then. Then it will call wakeup > >>> on ep->wq > >>> > > and this will wakeup another thread to process the remaining N/2. > >>> Without > >>> > > the wakeup, the original thread isn't going to process the events > >>> until > >>> > > it finishes with the original N/2 and gets back to epoll_wait(). > >>> So I'm not > >>> > > sure how important that path is but I wanted to at least note the > >>> change > >>> > > here would impact that behavior. > >>> > > > >>> > > Thanks, > >>> > > > >>> > > -Jason > >>> > > > >>> > > > >>> > > > > >>> > > >> So I think removing all that even for the > >>> > > >> depth 0 case is going to change some behavior here. So > >>> perhaps, it > >>> > > >> should be removed for all depths except for 0? And if so, it > >>> may be > >>> > > >> better to make 2 patches here to separate these changes. > >>> > > >> > >>> > > >> For the nested wakeups, I agree that the extra wakeups seem > >>> unnecessary > >>> > > >> and it may make sense to remove them for all depths. I don't > >>> think the > >>> > > >> nested epoll semantics are particularly well spelled out, and > >>> afaict, > >>> > > >> nested epoll() has behaved this way for quite some time. And > >>> the current > >>> > > >> behavior is not bad in the way that a missing wakeup or false > >>> negative > >>> > > >> would be. > >>> > > > > >>> > > > That's 100% true! For edge mode extra wake up is not a bug, not > >>> optimal > >>> > > > for userspace - yes, but that can't lead to any lost wakeups. > >>> > > > > >>> > > > -- > >>> > > > Roman > >>> > > > > >>> > > >>> > I tried to remove the whole chunk of code that Roman said, and it > >>> > seems that there > >>> > are no obvious problems with the two test programs below: > >> > >> I recall this message, the test case 9/25/26 of epoll-wakeup (on > >> github) are failed while > >> the whole chunk are removed. > >> > >> Apply the original patch, all tests passed. > > > > > > These are failing on my bare 5.2.0-rc2 > > > > TEST bin/epoll31 FAIL > > TEST bin/epoll46 FAIL > > TEST bin/epoll50 FAIL > > TEST bin/epoll32 FAIL > > TEST bin/epoll19 FAIL > > TEST bin/epoll27 FAIL > > TEST bin/epoll42 FAIL > > TEST bin/epoll34 FAIL > > TEST bin/epoll48 FAIL > > TEST bin/epoll40 FAIL > > TEST bin/epoll20 FAIL > > TEST bin/epoll28 FAIL > > TEST bin/epoll38 FAIL > > TEST bin/epoll52 FAIL > > TEST bin/epoll24 FAIL > > TEST bin/epoll23 FAIL > > > > > > These are failing if your patch is applied: > > (my 5.2.0-rc2 is old? broken?) > > > > TEST bin/epoll46 FAIL > > TEST bin/epoll42 FAIL > > TEST bin/epoll34 FAIL > > TEST bin/epoll48 FAIL > > TEST bin/epoll40 FAIL > > TEST bin/epoll44 FAIL > > TEST bin/epoll38 FAIL > > > > These are failing if "ep_poll_safewake(&ep->poll_wait)" is not called, > > but wakeup(&ep->wq); is still invoked: > > > > TEST bin/epoll46 FAIL > > TEST bin/epoll42 FAIL > > TEST bin/epoll34 FAIL > > TEST bin/epoll40 FAIL > > TEST bin/epoll44 FAIL > > TEST bin/epoll38 FAIL > > > > So at least 48 has been "fixed". > > > > These are failing if the whole chunk is removed, like your > > said 9,25,26 are among which do not pass: > > > > TEST bin/epoll26 FAIL > > TEST bin/epoll42 FAIL > > TEST bin/epoll34 FAIL > > TEST bin/epoll9 FAIL > > TEST bin/epoll48 FAIL > > TEST bin/epoll40 FAIL > > TEST bin/epoll25 FAIL > > TEST bin/epoll44 FAIL > > TEST bin/epoll38 FAIL > > > > This can be a good test suite, probably can be added to kselftests? Thank you, I have updated epoll-tests to fix these issues. I think this is good news if we can added to kselftests. ;) > > > > -- > > Roman > > > > > Indeed, I just tried the same test suite and I am seeing similar > failures - it looks like its a bit timing dependent. It looks like all > the failures are caused by a similar issue. For example, take epoll34: > > t0 t1 > (ew) | | (ew) > e0 | > (lt) \ / > | > e1 > | (et) > s0 > > > The test is trying to assert that an epoll_wait() on e1 and and > epoll_wait() on e0 both return 1 event for EPOLLIN. However, the > epoll_wait on e1 is done in a thread and it can happen before or after > the epoll_wait() is called against e0. If the epoll_wait() on e1 happens > first then because its attached as 'et', it consumes the event. So that > there is no longer an event reported at e0. I think that is reasonable > semantics here. However if the wait on e0 happens after the wait on e1 > then the test will pass as both waits will see the event. Thus, a patch > like this will 'fix' this testcase: > > --- a/src/epoll34.c > +++ b/src/epoll34.c > @@ -59,15 +59,15 @@ int main(int argc, char *argv[]) > if (epoll_ctl(efd[0], EPOLL_CTL_ADD, efd[1], &e) < 0) > goto out; > > - if (pthread_create(&tw, NULL, thread_handler, NULL) < 0) > - goto out; > - > if (pthread_create(&te, NULL, emit_handler, NULL) < 0) > goto out; > > if (epoll_wait(efd[0], &e, 1, 500) == 1) > count++; > > + if (pthread_create(&tw, NULL, thread_handler, NULL) < 0) > + goto out; > + > if (pthread_join(tw, NULL) < 0) > goto out; > > > I found all the other failures to be of similar origin. I suspect Heiher > didn't see failures due to the thread timings here. Thank you. I also found a multi-threaded concurrent accumulation problem, and that has been changed to atomic operations. I think we should allow two different behaviors to be passed because they are all correctly. thread 2: if (epoll_wait(efd[1], &e, 1, 500) == 1) __sync_fetch_and_or(&count, 1); thread1: if (epoll_wait(efd[0], &e, 1, 500) == 1) __sync_fetch_and_or(&count, 2); check: if ((count != 1) && (count != 3)) goto out; > > I also found that all the testcases pass if we leave the wakeup(&ep->wq) > call for the depth 0 case (and remove the pwake part). So, We need to keep the wakeup(&ep-wq) for all depth, and only wakeup(&ep->poll_wait) for depth 0 and/or ep->rdlist from empty to be not empty? > > Thanks, > > -Jason > > > > -- Best regards! Hev https://hev.cc