On 9/4/19 5:57 AM, Roman Penyaev wrote: > On 2019-09-03 23:08, Jason Baron wrote: >> On 9/2/19 11:36 AM, Roman Penyaev wrote: >>> Hi, >>> >>> This is indeed a bug. (quick side note: could you please remove efd[1] >>> from your test, because it is not related to the reproduction of a >>> current bug). >>> >>> Your patch lacks a good description, what exactly you've fixed. Let >>> me speak out loud and please correct me if I'm wrong, my understanding >>> of epoll internals has become a bit rusty: when epoll fds are nested >>> an attempt to harvest events (ep_scan_ready_list() call) produces a >>> second (repeated) event from an internal fd up to an external fd: >>> >>> epoll_wait(efd[0], ...): >>> ep_send_events(): >>> ep_scan_ready_list(depth=0): >>> ep_send_events_proc(): >>> ep_item_poll(): >>> ep_scan_ready_list(depth=1): >>> ep_poll_safewake(): >>> ep_poll_callback() >>> list_add_tail(&epi, &epi->rdllist); >>> ^^^^^^ >>> repeated event >>> >>> >>> In your patch you forbid wakeup for the cases, where depth != 0, i.e. >>> for all nested cases. That seems clear. But what if we can go further >>> and remove the whole chunk, which seems excessive: >>> >>> @@ -885,26 +886,11 @@ static __poll_t ep_scan_ready_list(struct >>> eventpoll *ep, >>> >>> - >>> - if (!list_empty(&ep->rdllist)) { >>> - /* >>> - * Wake up (if active) both the eventpoll wait list and >>> - * the ->poll() wait list (delayed after we release the >>> lock). >>> - */ >>> - if (waitqueue_active(&ep->wq)) >>> - wake_up(&ep->wq); >>> - if (waitqueue_active(&ep->poll_wait)) >>> - pwake++; >>> - } >>> write_unlock_irq(&ep->lock); >>> >>> if (!ep_locked) >>> mutex_unlock(&ep->mtx); >>> >>> - /* We have to call this outside the lock */ >>> - if (pwake) >>> - ep_poll_safewake(&ep->poll_wait); >>> >>> >>> I reason like that: by the time we've reached the point of scanning events >>> for readiness all wakeups from ep_poll_callback have been already fired and >>> new events have been already accounted in ready list (ep_poll_callback() >>> calls >>> the same ep_poll_safewake()). Here, frankly, I'm not 100% sure and probably >>> missing some corner cases. >>> >>> Thoughts? >> >> So the: 'wake_up(&ep->wq);' part, I think is about waking up other >> threads that may be in waiting in epoll_wait(). For example, there may >> be multiple threads doing epoll_wait() on the same epoll fd, and the >> logic above seems to say thread 1 may have processed say N events and >> now its going to to go off to work those, so let's wake up thread 2 now >> to handle the next chunk. > > Not quite. Thread which calls ep_scan_ready_list() processes all the > events, and while processing those, removes them one by one from the > ready list. But if event mask is !0 and event belongs to > Level Triggered Mode descriptor (let's say default mode) it tails event > again back to the list (because we are in level mode, so event should > be there). So at the end of this traversing loop ready list is likely > not empty, and if so, wake up again is called for nested epoll fds. > But, those nested epoll fds should get already all the notifications > from the main event callback ep_poll_callback(), regardless any thread > which traverses events. > > I suppose this logic exists for decades, when Davide (the author) was > reshuffling the code here and there. > > But I do not feel confidence to state that this extra wakeup is bogus, > I just have a gut feeling that it looks excessive. Note that I was talking about the wakeup done on ep->wq not ep->poll_wait. The path that I'm concerned about is let's say that there are N events queued on the ready list. A thread that was woken up in epoll_wait may decide to only process say N/2 of then. Then it will call wakeup on ep->wq and this will wakeup another thread to process the remaining N/2. Without the wakeup, the original thread isn't going to process the events until it finishes with the original N/2 and gets back to epoll_wait(). So I'm not sure how important that path is but I wanted to at least note the change here would impact that behavior. Thanks, -Jason > >> So I think removing all that even for the >> depth 0 case is going to change some behavior here. So perhaps, it >> should be removed for all depths except for 0? And if so, it may be >> better to make 2 patches here to separate these changes. >> >> For the nested wakeups, I agree that the extra wakeups seem unnecessary >> and it may make sense to remove them for all depths. I don't think the >> nested epoll semantics are particularly well spelled out, and afaict, >> nested epoll() has behaved this way for quite some time. And the current >> behavior is not bad in the way that a missing wakeup or false negative >> would be. > > That's 100% true! For edge mode extra wake up is not a bug, not optimal > for userspace - yes, but that can't lead to any lost wakeups. > > -- > Roman >