Re: epoll_wait misses edge-triggered eventfd events: bug in Linux 5.3 and 5.4

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Max and all,

I can reproduce the issue.  My epoll optimization which you referenced
did not consider the case of wakeups on epoll_ctl() path, only the fd
update path.

I will send the fix upstream today/tomorrow (already tested on the
epollbug.c), the exemplary patch at the bottom of the current
email.

Also I would like to  submit the epollbug.c as a test case for
the epoll test suite. Does the author of epollbug have any
objections?

Thanks.

--
Roman

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index c4159bcc05d9..a90f8b8a5def 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -745,7 +745,7 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep, * the ->poll() wait list (delayed after we release the lock).
                 */
                if (waitqueue_active(&ep->wq))
-                       wake_up(&ep->wq);
+                       wake_up_locked(&ep->wq);
                if (waitqueue_active(&ep->poll_wait))
                        pwake++;
        }
@@ -1200,7 +1200,7 @@ static inline bool chain_epi_lockless(struct epitem *epi) * Another thing worth to mention is that ep_poll_callback() can be called * concurrently for the same @epi from different CPUs if poll table was inited * with several wait queues entries. Plural wakeup from different CPUs of a - * single wait queue is serialized by wq.lock, but the case when multiple wait + * single wait queue is serialized by ep->lock, but the case when multiple wait * queues are used should be detected accordingly. This is detected using
  * cmpxchg() operation.
  */
@@ -1275,6 +1275,13 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
                                break;
                        }
                }
+               /*
+ * Since here we have the read lock (ep->lock) taken, plural + * wakeup from different CPUs can occur, thus we call wake_up() + * variant which implies its own lock on wqueue. All other paths + * take write lock, thus modifications on ep->wq are serialized
+                * by rw lock.
+                */
                wake_up(&ep->wq);
        }
        if (waitqueue_active(&ep->poll_wait))
@@ -1578,7 +1585,7 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event,

                /* Notify waiting tasks that events are available */
                if (waitqueue_active(&ep->wq))
-                       wake_up(&ep->wq);
+                       wake_up_locked(&ep->wq);
                if (waitqueue_active(&ep->poll_wait))
                        pwake++;
        }
@@ -1684,7 +1691,7 @@ static int ep_modify(struct eventpoll *ep, struct epitem *epi,

/* Notify waiting tasks that events are available */
                        if (waitqueue_active(&ep->wq))
-                               wake_up(&ep->wq);
+                               wake_up_locked(&ep->wq);
                        if (waitqueue_active(&ep->poll_wait))
                                pwake++;
                }
@@ -1881,9 +1888,9 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
                waiter = true;
                init_waitqueue_entry(&wait, current);

-               spin_lock_irq(&ep->wq.lock);
+               write_lock_irq(&ep->lock);
                __add_wait_queue_exclusive(&ep->wq, &wait);
-               spin_unlock_irq(&ep->wq.lock);
+               write_unlock_irq(&ep->lock);
        }

        for (;;) {
@@ -1931,9 +1938,9 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
                goto fetch_events;

        if (waiter) {
-               spin_lock_irq(&ep->wq.lock);
+               write_lock_irq(&ep->lock);
                __remove_wait_queue(&ep->wq, &wait);
-               spin_unlock_irq(&ep->wq.lock);
+               write_unlock_irq(&ep->lock);
        }

        return res;




On 2020-02-03 16:15, Max Neunhoeffer wrote:
Dear Jakub and all,

I have done a git bisect and found that this commit introduced the epoll
bug:

https://github.com/torvalds/linux/commit/a218cc4914209ac14476cb32769b31a556355b22

I Cc the author of the commit.

This makes sense, since the commit introduces a new rwlock to reduce
contention in ep_poll_callback. I do not fully understand the details
but this sounds all very close to this bug.

I have also verified that the bug is still present in the latest master
branch in Linus' repository.

Furthermore, Chris Kohlhoff has provided yet another reproducing program
which is no longer using edge-triggered but standard level-triggered
events and epoll_wait. This makes the bug all the more urgent, since
potentially more programs could run into this problem and could end up
with sleeping barbers.

I have added all the details to the bugzilla bugreport:

  https://bugzilla.kernel.org/show_bug.cgi?id=205933

Hopefully, we can resolve this now equipped with this amount of information.

Best regards,
  Max.

On 20/02/01 12:16, Jakub Kicinski wrote:
On Fri, 31 Jan 2020 14:57:30 +0100, Max Neunhoeffer wrote:
> Dear All,
>
> I believe I have found a bug in Linux 5.3 and 5.4 in epoll_wait/epoll_ctl
> when an eventfd together with edge-triggered or the EPOLLONESHOT policy
> is used. If an epoll_ctl call to rearm the eventfd happens approximately
> at the same time as the epoll_wait goes to sleep, the event can be lost,
> even though proper protection through a mutex is employed.
>
> The details together with two programs showing the problem can be found
> here:
>
>   https://bugzilla.kernel.org/show_bug.cgi?id=205933
>
> Older kernels seem not to have this problem, although I did not test all
> versions. I know that 4.15 and 5.0 do not show the problem.
>
> Note that this method of using epoll_wait/eventfd is used by
> boost::asio to wake up event loops in case a new completion handler
> is posted to an io_service, so this is probably relevant for many
> applications.
>
> Any help with this would be appreciated.

Could be networking related but let's CC FS folks just in case.

Would you be able to perform bisection to narrow down the search
for a buggy change?




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux