On 4/28/20 2:10 PM, Roman Penyaev wrote: > On 2020-04-27 22:38, Jason Baron wrote: >> On 4/25/20 4:59 PM, Khazhismel Kumykov wrote: >>> On Sat, Apr 25, 2020 at 9:17 AM Jason Baron <jbaron@xxxxxxxxxx> wrote: >>>> >>>> >>>> >>>> On 4/24/20 3:00 PM, Khazhismel Kumykov wrote: >>>>> In the event that we add to ovflist, before 339ddb53d373 we would be >>>>> woken up by ep_scan_ready_list, and did no wakeup in ep_poll_callback. >>>>> With that wakeup removed, if we add to ovflist here, we may never wake >>>>> up. Rather than adding back the ep_scan_ready_list wakeup - which was >>>>> resulting in unnecessary wakeups, trigger a wake-up in ep_poll_callback. >>>> >>>> I'm just curious which 'wakeup' we are talking about here? There is: >>>> wake_up(&ep->wq) for the 'ep' and then there is the nested one via: >>>> ep_poll_safewake(ep, epi). It seems to me that its only about the later >>>> one being missing not both? Is your workload using nested epoll? >>>> >>>> If so, it might make sense to just do the later, since the point of >>>> the original patch was to minimize unnecessary wakeups. >>> >>> The missing wake-ups were when we added to ovflist instead of rdllist. >>> Both are "the ready list" together - so I'd think we'd want the same >>> wakeups regardless of which specific list we added to. >>> ep_poll_callback isn't nested specific? >>> >> >> So I was thinking that ep_poll() would see these events on the >> ovflist without an explicit wakeup, b/c the overflow list being active >> implies that the ep_poll() path would add them to the rdllist in >> ep_scan_read_list(). Thus, it will see the events either in the >> current ep_poll() context or via a subsequent syscall to epoll_wait(). >> >> However, there are other paths that can call ep_scan_ready_list() thus >> I agree with you that both wakeups here are necessary. >> >> I do think are are still (smaller) potential races in ep_scan_ready_list() >> where we have: >> >> write_lock_irq(&ep->lock); >> list_splice_init(&ep->rdllist, &txlist); >> WRITE_ONCE(ep->ovflist, NULL); >> write_unlock_irq(&ep->lock); >> >> And in the ep_poll path we have: >> >> static inline int ep_events_available(struct eventpoll *ep) >> { >> return !list_empty_careful(&ep->rdllist) || >> READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR; >> } >> >> >> Seems to me that first bit needs to be the following, since >> ep_events_available() is now checked in a lockless way: >> >> >> write_lock_irq(&ep->lock); >> WRITE_ONCE(ep->ovflist, NULL); >> smp_wmb(); >> list_splice_init(&ep->rdllist, &txlist); >> write_unlock_irq(&ep->lock); > > > Hi Jason, > > For the first chunk you refer the order seems irrelevant. > Either you see something not empty, you go take the lock > and then check lists under the lock, either you see all > lists are empty. > Hi Roman, It does matter. Let's say we have: epfd1->epfd2->socket. And thread a is doing an epoll_wait() on epfd1, and thread b is doing epoll_wait on epfd2. then: 1) data comes in on socket ep_poll_callback() wakes up threads a and b 2) thread a runs ep_poll() ep_scan_ready_list() ep_send_events_proc() ep_item_poll() ep_scan_ready_list() list_splice_init(&ep->rdllist, &txlist); 3) now thread b is running ep_poll() ep_events_available() returns false schedule_hrtimeout_range() Thus, thread c has missed a wakeup and will never get it. Similarly, for the second chunk I referenced. >> >> And also this bit: >> >> WRITE_ONCE(ep->ovflist, EP_UNACTIVE_PTR)>> >> /* >> * Quickly re-inject items left on "txlist". >> */ >> list_splice(&txlist, &ep->rdllist); >> >> Should I think better be reversed as well to: >> >> list_splice(&txlist, &ep->rdllist); >> smp_wmb(); >> WRITE_ONCE(ep->ovflist, EP_UNACTIVE_PTR); > > But this one is much more interesting. I understand what you > are trying to achieve: we can't leave both lists empty for the > short period of time, if there is something left the caller > of ep_events_available() should be able to see one of the lists > is not empty, otherwise it can be too late. > > But the problem you've spotted is much worse. Some remains > can be in the txlist (this can happen if caller of epoll_wait > wants to harvest only 1 event, but there are more in the ->rdlist). > And we again get the lost wakeup. > > Problem is reproduced by the test below. The idea is simple: > we have 10 threads and 10 event fds. Each thread can harvest > only 1 event. 1 producer fires all 10 events at once and waits > that all 10 events will be observed by 10 threads. > > The fix is basically a revert of 339ddb53d373 with 1 major > exception: we do wakeups from ep_scan_ready_list() but > if txlist is not empty && if ep_scan_ready_list() is called > from the routine, which sends events, not reads it > (thus we protect ourselves from repeated wake ups) > > I will send the code a bit later. This was discussed as part of the original discussion around 339ddb53d373: https://lkml.org/lkml/2019/10/7/905 The context was more a performance difference rather than a semantic difference, but as I pointed out I believe that behavior pre-dates the above commit and goes back to: 86c0517 fs/epoll: deal with wait_queue only once There, since the thread is left on the waitqueue over the ep_scan_ready_list() thus the ep_wakeup() (that was removed in 339ddb53d373), would no longer wakeup other potential waiters. So since I think this behavior change goes back to 5.0 and there really haven't been any reports, I don't think there are too many apps relying on these semantics that your test case is showing. It would be interesting to confirm that your test does indeed succeed/fail before/after that patch. Also, as part of that original discussion, you had a patch that I think addresses this. I would be ok with that, in addition to a patch to address the ordering issue I pointed out. I can post a patch for the former, if you think this plan makes sense? Thanks, -Jason > > -- > Roman > > ---- test ------- > > enum { > EPOLL60_EVENTS_NR = 10, > }; > > struct epoll60_ctx { > volatile int stopped; > int ready; > int waiters; > int epfd; > int evfd[EPOLL60_EVENTS_NR]; > }; > > static inline int count_waiters(struct epoll60_ctx *ctx) > { > return __atomic_load_n(&ctx->waiters, __ATOMIC_ACQUIRE); > } > > static void *epoll60_wait_thread(void *ctx_) > { > struct epoll60_ctx *ctx = ctx_; > struct epoll_event e; > uint64_t v; > int ret; > > while (!ctx->stopped) { > /* Mark we are ready */ > __atomic_fetch_add(&ctx->ready, 1, __ATOMIC_ACQUIRE); > > /* Start when all are ready */ > while (__atomic_load_n(&ctx->ready, __ATOMIC_ACQUIRE) && > !ctx->stopped); > > /* Account this waiter */ > __atomic_fetch_add(&ctx->waiters, 1, __ATOMIC_ACQUIRE); > again_wait: > ret = epoll_wait(ctx->epfd, &e, 1, 1000); > if (ret != 1) { > /* Should be stopped, otherwise we lost wakeup */ > assert(ctx->stopped); > return NULL; > } > > ret = read(e.data.fd, &v, sizeof(v)); > if (ret != sizeof(v)) { > /* Event was stollen by other thread */ > goto again_wait; > } > __atomic_fetch_sub(&ctx->waiters, 1, __ATOMIC_RELEASE); > } > > return NULL; > } > > static inline unsigned long long msecs(void) > { > struct timespec ts; > unsigned long long msecs; > > clock_gettime(CLOCK_REALTIME, &ts); > msecs = ts.tv_sec * 1000ull; > msecs += ts.tv_nsec / 1000000ull; > > return msecs; > } > > TEST(epoll60) > { > struct epoll60_ctx ctx = { 0 }; > pthread_t waiters[ARRAY_SIZE(ctx.evfd)]; > struct epoll_event e; > int i, n, ret; > > signal(SIGUSR1, signal_handler); > > ctx.epfd = epoll_create1(0); > ASSERT_GE(ctx.epfd, 0); > > /* Create event fds */ > for (i = 0; i < ARRAY_SIZE(ctx.evfd); i++) { > ctx.evfd[i] = eventfd(0, EFD_NONBLOCK); > ASSERT_GE(ctx.evfd[i], 0); > > e.events = EPOLLIN | EPOLLERR; > e.data.fd = ctx.evfd[i]; > ASSERT_EQ(epoll_ctl(ctx.epfd, EPOLL_CTL_ADD, ctx.evfd[i], &e), 0); > } > > /* Create waiter threads */ > for (i = 0; i < ARRAY_SIZE(waiters); i++) > ASSERT_EQ(pthread_create(&waiters[i], NULL, > epoll60_wait_thread, &ctx), 0); > > for (i = 0; i < 300; i++) { > uint64_t v = 1, ms; > > /* Wait for all to be ready */ > while (__atomic_load_n(&ctx.ready, __ATOMIC_ACQUIRE) != > ARRAY_SIZE(ctx.evfd)) > ; > > /* Steady, go */ > __atomic_fetch_sub(&ctx.ready, ARRAY_SIZE(ctx.evfd), > __ATOMIC_ACQUIRE); > > /* Wait all have gone to kernel */ > while (count_waiters(&ctx) != ARRAY_SIZE(ctx.evfd)) > ; > > /* 1ms should be enough to schedule out */ > usleep(1000); > > /* Quickly signal all handles at once */ > for (n = 0; n < ARRAY_SIZE(ctx.evfd); n++) { > ret = write(ctx.evfd[n], &v, sizeof(v)); > ASSERT_EQ(ret, sizeof(v)); > } > > /* Busy loop for 1s and wait for all waiters to wake up */ > ms = msecs(); > while (count_waiters(&ctx) && msecs() < ms + 3000) > ; > > ASSERT_EQ(count_waiters(&ctx), 0); > } > ctx.stopped = 1; > /* Stop waiters */ > for (i = 0; i < ARRAY_SIZE(waiters); i++) { > pthread_kill(waiters[i], SIGUSR1); > pthread_join(waiters[i], NULL); > } > > for (i = 0; i < ARRAY_SIZE(waiters); i++) > close(ctx.evfd[i]); > close(ctx.epfd); > } > >