Re: [PATCH v2] eventpoll: fix missing wakeup for ovflist in ep_poll_callback

Jason Baron <jbaron@xxxxxxxxxx> · Wed, 29 Apr 2020 00:12:03 -0400

On 4/28/20 2:10 PM, Roman Penyaev wrote:
> On 2020-04-27 22:38, Jason Baron wrote:
>> On 4/25/20 4:59 PM, Khazhismel Kumykov wrote:
>>> On Sat, Apr 25, 2020 at 9:17 AM Jason Baron <jbaron@xxxxxxxxxx> wrote:
>>>>
>>>>
>>>>
>>>> On 4/24/20 3:00 PM, Khazhismel Kumykov wrote:
>>>>> In the event that we add to ovflist, before 339ddb53d373 we would be
>>>>> woken up by ep_scan_ready_list, and did no wakeup in ep_poll_callback.
>>>>> With that wakeup removed, if we add to ovflist here, we may never wake
>>>>> up. Rather than adding back the ep_scan_ready_list wakeup - which was
>>>>> resulting in unnecessary wakeups, trigger a wake-up in ep_poll_callback.
>>>>
>>>> I'm just curious which 'wakeup' we are talking about here? There is:
>>>> wake_up(&ep->wq) for the 'ep' and then there is the nested one via:
>>>> ep_poll_safewake(ep, epi). It seems to me that its only about the later
>>>> one being missing not both? Is your workload using nested epoll?
>>>>
>>>> If so, it might make sense to just do the later, since the point of
>>>> the original patch was to minimize unnecessary wakeups.
>>>
>>> The missing wake-ups were when we added to ovflist instead of rdllist.
>>> Both are "the ready list" together - so I'd think we'd want the same
>>> wakeups regardless of which specific list we added to.
>>> ep_poll_callback isn't nested specific?
>>>
>>
>> So I was thinking that ep_poll() would see these events on the
>> ovflist without an explicit wakeup, b/c the overflow list being active
>> implies that the ep_poll() path would add them to the rdllist in
>> ep_scan_read_list(). Thus, it will see the events either in the
>> current ep_poll() context or via a subsequent syscall to epoll_wait().
>>
>> However, there are other paths that can call ep_scan_ready_list() thus
>> I agree with you that both wakeups here are necessary.
>>
>> I do think are are still (smaller) potential races in ep_scan_ready_list()
>> where we have:
>>
>>         write_lock_irq(&ep->lock);
>>         list_splice_init(&ep->rdllist, &txlist);
>>         WRITE_ONCE(ep->ovflist, NULL);
>>         write_unlock_irq(&ep->lock);
>>
>> And in the ep_poll path we have:
>>
>> static inline int ep_events_available(struct eventpoll *ep)
>> {
>>         return !list_empty_careful(&ep->rdllist) ||
>>                 READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR;
>> }
>>
>>
>> Seems to me that first bit needs to be the following, since
>> ep_events_available() is now checked in a lockless way:
>>
>>
>>         write_lock_irq(&ep->lock);
>>     WRITE_ONCE(ep->ovflist, NULL);
>>     smp_wmb();
>>         list_splice_init(&ep->rdllist, &txlist);
>>         write_unlock_irq(&ep->lock);
> 
> 
> Hi Jason,
> 
> For the first chunk you refer the order seems irrelevant.
> Either you see something not empty, you go take the lock
> and then check lists under the lock, either you see all
> lists are empty.
> 

Hi Roman,

It does matter. Let's say we have:

epfd1->epfd2->socket. And thread a is doing an
epoll_wait() on epfd1, and thread b is doing
epoll_wait on epfd2. then:

1) data comes in on socket

ep_poll_callback() wakes up threads a and b

2) thread a runs

ep_poll()
 ep_scan_ready_list()
  ep_send_events_proc()
   ep_item_poll()
     ep_scan_ready_list()
       list_splice_init(&ep->rdllist, &txlist);

3) now thread b is running

ep_poll()
 ep_events_available()
   returns false
 schedule_hrtimeout_range()

Thus, thread c has missed a wakeup and will never
get it.

Similarly, for the second chunk I referenced.

>>
>> And also this bit:
>>
>>         WRITE_ONCE(ep->ovflist, EP_UNACTIVE_PTR)>>
>>         /*
>>          * Quickly re-inject items left on "txlist".
>>          */
>>         list_splice(&txlist, &ep->rdllist);
>>
>> Should I think better be reversed as well to:
>>
>> list_splice(&txlist, &ep->rdllist);
>> smp_wmb();
>> WRITE_ONCE(ep->ovflist, EP_UNACTIVE_PTR);
> 
> But this one is much more interesting.  I understand what you
> are trying to achieve: we can't leave both lists empty for the
> short period of time, if there is something left the caller
> of ep_events_available() should be able to see one of the lists
> is not empty, otherwise it can be too late.
> 
> But the problem you've spotted is much worse. Some remains
> can be in the txlist (this can happen if caller of epoll_wait
> wants to harvest only 1 event, but there are more in the ->rdlist).
> And we again get the lost wakeup.
> 
> Problem is reproduced by the test below.  The idea is simple:
> we have 10 threads and 10 event fds. Each thread can harvest
> only 1 event. 1 producer fires all 10 events at once and waits
> that all 10 events will be observed by 10 threads.
> 
> The fix is basically a revert of 339ddb53d373 with 1 major
> exception: we do wakeups from ep_scan_ready_list() but
> if txlist is not empty && if ep_scan_ready_list() is called
> from the routine, which sends events, not reads it
> (thus we protect ourselves from repeated wake ups)
> 
> I will send the code a bit later.

This was discussed as part of the original discussion around
339ddb53d373: https://lkml.org/lkml/2019/10/7/905

The context was more a performance difference rather than
a semantic difference, but as I pointed out I believe that
behavior pre-dates the above commit and goes back to:
86c0517 fs/epoll: deal with wait_queue only once

There, since the thread is left on the waitqueue over the
ep_scan_ready_list() thus the ep_wakeup() (that was removed
in 339ddb53d373), would no longer wakeup other potential
waiters.

So since I think this behavior change goes back to 5.0 and there
really haven't been any reports, I don't think there are
too many apps relying on these semantics that your test
case is showing. It would be interesting to confirm that
your test does indeed succeed/fail before/after that patch.

Also, as part of that original discussion, you had a patch
that I think addresses this. I would be ok with that, in
addition to a patch to address the ordering issue I pointed
out. I can post a patch for the former, if you think this
plan makes sense?

Thanks,

-Jason

> 
> -- 
> Roman
> 
> ---- test -------
> 
> enum {
>     EPOLL60_EVENTS_NR = 10,
> };
> 
> struct epoll60_ctx {
>     volatile int stopped;
>     int ready;
>     int waiters;
>     int epfd;
>     int evfd[EPOLL60_EVENTS_NR];
> };
> 
> static inline int count_waiters(struct epoll60_ctx *ctx)
> {
>     return __atomic_load_n(&ctx->waiters, __ATOMIC_ACQUIRE);
> }
> 
> static void *epoll60_wait_thread(void *ctx_)
> {
>     struct epoll60_ctx *ctx = ctx_;
>     struct epoll_event e;
>     uint64_t v;
>     int ret;
> 
>     while (!ctx->stopped) {
>         /* Mark we are ready */
>         __atomic_fetch_add(&ctx->ready, 1, __ATOMIC_ACQUIRE);
> 
>         /* Start when all are ready */
>         while (__atomic_load_n(&ctx->ready, __ATOMIC_ACQUIRE) &&
>                !ctx->stopped);
> 
>         /* Account this waiter */
>         __atomic_fetch_add(&ctx->waiters, 1, __ATOMIC_ACQUIRE);
> again_wait:
>         ret = epoll_wait(ctx->epfd, &e, 1, 1000);
>         if (ret != 1) {
>             /* Should be stopped, otherwise we lost wakeup */
>             assert(ctx->stopped);
>             return NULL;
>         }
> 
>         ret = read(e.data.fd, &v, sizeof(v));
>         if (ret != sizeof(v)) {
>             /* Event was stollen by other thread */
>             goto again_wait;
>         }
>         __atomic_fetch_sub(&ctx->waiters, 1, __ATOMIC_RELEASE);
>     }
> 
>     return NULL;
> }
> 
> static inline unsigned long long msecs(void)
> {
>     struct timespec ts;
>     unsigned long long msecs;
> 
>     clock_gettime(CLOCK_REALTIME, &ts);
>     msecs = ts.tv_sec * 1000ull;
>     msecs += ts.tv_nsec / 1000000ull;
> 
>     return msecs;
> }
> 
> TEST(epoll60)
> {
>     struct epoll60_ctx ctx = { 0 };
>     pthread_t waiters[ARRAY_SIZE(ctx.evfd)];
>     struct epoll_event e;
>     int i, n, ret;
> 
>     signal(SIGUSR1, signal_handler);
> 
>     ctx.epfd = epoll_create1(0);
>     ASSERT_GE(ctx.epfd, 0);
> 
>     /* Create event fds */
>     for (i = 0; i < ARRAY_SIZE(ctx.evfd); i++) {
>         ctx.evfd[i] = eventfd(0, EFD_NONBLOCK);
>         ASSERT_GE(ctx.evfd[i], 0);
> 
>         e.events = EPOLLIN | EPOLLERR;
>         e.data.fd = ctx.evfd[i];
>         ASSERT_EQ(epoll_ctl(ctx.epfd, EPOLL_CTL_ADD, ctx.evfd[i], &e), 0);
>     }
> 
>     /* Create waiter threads */
>     for (i = 0; i < ARRAY_SIZE(waiters); i++)
>         ASSERT_EQ(pthread_create(&waiters[i], NULL,
>                      epoll60_wait_thread, &ctx), 0);
> 
>     for (i = 0; i < 300; i++) {
>         uint64_t v = 1, ms;
> 
>         /* Wait for all to be ready */
>         while (__atomic_load_n(&ctx.ready, __ATOMIC_ACQUIRE) !=
>                ARRAY_SIZE(ctx.evfd))
>             ;
> 
>         /* Steady, go */
>         __atomic_fetch_sub(&ctx.ready, ARRAY_SIZE(ctx.evfd),
>                    __ATOMIC_ACQUIRE);
> 
>         /* Wait all have gone to kernel */
>         while (count_waiters(&ctx) != ARRAY_SIZE(ctx.evfd))
>             ;
> 
>         /* 1ms should be enough to schedule out */
>         usleep(1000);
> 
>         /* Quickly signal all handles at once */
>         for (n = 0; n < ARRAY_SIZE(ctx.evfd); n++) {
>             ret = write(ctx.evfd[n], &v, sizeof(v));
>             ASSERT_EQ(ret, sizeof(v));
>         }
> 
>         /* Busy loop for 1s and wait for all waiters to wake up */
>         ms = msecs();
>         while (count_waiters(&ctx) && msecs() < ms + 3000)
>             ;
> 
>         ASSERT_EQ(count_waiters(&ctx), 0);
>     }
>     ctx.stopped = 1;
>     /* Stop waiters */
>     for (i = 0; i < ARRAY_SIZE(waiters); i++) {
>         pthread_kill(waiters[i], SIGUSR1);
>         pthread_join(waiters[i], NULL);
>     }
> 
>     for (i = 0; i < ARRAY_SIZE(waiters); i++)
>         close(ctx.evfd[i]);
>     close(ctx.epfd);
> }
> 
>