On Mon, 3 Feb 2020 16:15:36 +0100, Max Neunhoeffer wrote: > Dear Jakub and all, > > I have done a git bisect and found that this commit introduced the epoll > bug: > > https://github.com/torvalds/linux/commit/a218cc4914209ac14476cb32769b31a556355b22 > > I Cc the author of the commit. Awesome, thanks a lot for doing that! Hopefully Roman can take a look soon. Breaking boost::asio seems like a pretty serious regression. > This makes sense, since the commit introduces a new rwlock to reduce > contention in ep_poll_callback. I do not fully understand the details > but this sounds all very close to this bug. > > I have also verified that the bug is still present in the latest master > branch in Linus' repository. > > Furthermore, Chris Kohlhoff has provided yet another reproducing program > which is no longer using edge-triggered but standard level-triggered > events and epoll_wait. This makes the bug all the more urgent, since > potentially more programs could run into this problem and could end up > with sleeping barbers. > > I have added all the details to the bugzilla bugreport: > > https://bugzilla.kernel.org/show_bug.cgi?id=205933 > > Hopefully, we can resolve this now equipped with this amount of information. > > Best regards, > Max. > > On 20/02/01 12:16, Jakub Kicinski wrote: > > On Fri, 31 Jan 2020 14:57:30 +0100, Max Neunhoeffer wrote: > > > Dear All, > > > > > > I believe I have found a bug in Linux 5.3 and 5.4 in epoll_wait/epoll_ctl > > > when an eventfd together with edge-triggered or the EPOLLONESHOT policy > > > is used. If an epoll_ctl call to rearm the eventfd happens approximately > > > at the same time as the epoll_wait goes to sleep, the event can be lost, > > > even though proper protection through a mutex is employed. > > > > > > The details together with two programs showing the problem can be found > > > here: > > > > > > https://bugzilla.kernel.org/show_bug.cgi?id=205933 > > > > > > Older kernels seem not to have this problem, although I did not test all > > > versions. I know that 4.15 and 5.0 do not show the problem. > > > > > > Note that this method of using epoll_wait/eventfd is used by > > > boost::asio to wake up event loops in case a new completion handler > > > is posted to an io_service, so this is probably relevant for many > > > applications. > > > > > > Any help with this would be appreciated. > > > > Could be networking related but let's CC FS folks just in case. > > > > Would you be able to perform bisection to narrow down the search > > for a buggy change?