On Tue, 2013-01-01 at 23:56 +0000, Eric Wong wrote: > Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > > Please document the barrier that this mb() pairs with, and then give > > an explanation for the fix in the commit message, and I'll happily > > take it. Even if it's just duplicating the comments above the > > wq_has_sleeper() function, except modified for the ep_modify() case. > > Hopefully my explanation is correct and makes sense below, > I think both effects of the barrier are needed > > > Of course, it would be good to get verification from Jason and Andreas > > that the alternate patch also works for them. > > Jason just confirmed it. > > ------------------------------- 8< ---------------------------- > From 02f43757d04bb6f2786e79eecf1cfa82e6574379 Mon Sep 17 00:00:00 2001 > From: Eric Wong <normalperson@xxxxxxxx> > Date: Tue, 1 Jan 2013 21:20:27 +0000 > Subject: [PATCH] epoll: prevent missed events on EPOLL_CTL_MOD > > EPOLL_CTL_MOD sets the interest mask before calling f_op->poll() to > ensure events are not missed. Since the modifications to the interest > mask are not protected by the same lock as ep_poll_callback, we need to > ensure the change is visible to other CPUs calling ep_poll_callback. > > We also need to ensure f_op->poll() has an up-to-date view of past > events which occured before we modified the interest mask. So this > barrier also pairs with the barrier in wq_has_sleeper(). > > This should guarantee either ep_poll_callback or f_op->poll() (or both) > will notice the readiness of a recently-ready/modified item. > > This issue was encountered by Andreas Voellmy and Junchang(Jason) Wang in: > http://thread.gmane.org/gmane.linux.kernel/1408782/ > > Signed-off-by: Eric Wong <normalperson@xxxxxxxx> > Cc: Hans Verkuil <hans.verkuil@xxxxxxxxx> > Cc: Jiri Olsa <jolsa@xxxxxxxxxx> > Cc: Jonathan Corbet <corbet@xxxxxxx> > Cc: Al Viro <viro@xxxxxxxxxxxxxxxxxx> > Cc: Davide Libenzi <davidel@xxxxxxxxxxxxxxx> > Cc: Hans de Goede <hdegoede@xxxxxxxxxx> > Cc: Mauro Carvalho Chehab <mchehab@xxxxxxxxxxxxx> > Cc: David Miller <davem@xxxxxxxxxxxxx> > Cc: Eric Dumazet <eric.dumazet@xxxxxxxxx> > Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> > Cc: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> > Cc: Andreas Voellmy <andreas.voellmy@xxxxxxxx> > Tested-by: "Junchang(Jason) Wang" <junchang.wang@xxxxxxxx> > Cc: netdev@xxxxxxxxxxxxxxx > Cc: linux-fsdevel@xxxxxxxxxxxxxxx > --- > fs/eventpoll.c | 22 +++++++++++++++++++++- > 1 file changed, 21 insertions(+), 1 deletion(-) > > diff --git a/fs/eventpoll.c b/fs/eventpoll.c > index cd96649..39573ee 100644 > --- a/fs/eventpoll.c > +++ b/fs/eventpoll.c > @@ -1285,7 +1285,7 @@ static int ep_modify(struct eventpoll *ep, struct epitem *epi, struct epoll_even > * otherwise we might miss an event that happens between the > * f_op->poll() call and the new event set registering. > */ > - epi->event.events = event->events; > + epi->event.events = event->events; /* need barrier below */ > pt._key = event->events; > epi->event.data = event->data; /* protected by mtx */ > if (epi->event.events & EPOLLWAKEUP) { > @@ -1296,6 +1296,26 @@ static int ep_modify(struct eventpoll *ep, struct epitem *epi, struct epoll_even > } > > /* > + * The following barrier has two effects: > + * > + * 1) Flush epi changes above to other CPUs. This ensures > + * we do not miss events from ep_poll_callback if an > + * event occurs immediately after we call f_op->poll(). > + * We need this because we did not take ep->lock while > + * changing epi above (but ep_poll_callback does take > + * ep->lock). > + * > + * 2) We also need to ensure we do not miss _past_ events > + * when calling f_op->poll(). This barrier also > + * pairs with the barrier in wq_has_sleeper (see > + * comments for wq_has_sleeper). > + * > + * This barrier will now guarantee ep_poll_callback or f_op->poll > + * (or both) will notice the readiness of an item. > + */ > + smp_mb(); > + > + /* > * Get current event bits. We can safely use the file* here because > * its usage count has been increased by the caller of this function. > */ > -- > Eric Wong First, thanks for working on this issue. It seems the real problem is the epi->event.events = event->events; which is done without taking ep->lock While a smp_mb() could reduce the race window, I believe there is still a race, and the following patch would close it. diff --git a/fs/eventpoll.c b/fs/eventpoll.c index be56b21..25e5c53 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -1313,7 +1313,10 @@ static int ep_modify(struct eventpoll *ep, struct epitem *epi, struct epoll_even * otherwise we might miss an event that happens between the * f_op->poll() call and the new event set registering. */ + spin_lock_irq(&ep->lock); epi->event.events = event->events; + spin_unlock_irq(&ep->lock); + pt._key = event->events; epi->event.data = event->data; /* protected by mtx */ if (epi->event.events & EPOLLWAKEUP) { -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html