On 4/25/20 4:59 PM, Khazhismel Kumykov wrote: > On Sat, Apr 25, 2020 at 9:17 AM Jason Baron <jbaron@xxxxxxxxxx> wrote: >> >> >> >> On 4/24/20 3:00 PM, Khazhismel Kumykov wrote: >>> In the event that we add to ovflist, before 339ddb53d373 we would be >>> woken up by ep_scan_ready_list, and did no wakeup in ep_poll_callback. >>> With that wakeup removed, if we add to ovflist here, we may never wake >>> up. Rather than adding back the ep_scan_ready_list wakeup - which was >>> resulting in unnecessary wakeups, trigger a wake-up in ep_poll_callback. >> >> I'm just curious which 'wakeup' we are talking about here? There is: >> wake_up(&ep->wq) for the 'ep' and then there is the nested one via: >> ep_poll_safewake(ep, epi). It seems to me that its only about the later >> one being missing not both? Is your workload using nested epoll? >> >> If so, it might make sense to just do the later, since the point of >> the original patch was to minimize unnecessary wakeups. > > The missing wake-ups were when we added to ovflist instead of rdllist. > Both are "the ready list" together - so I'd think we'd want the same > wakeups regardless of which specific list we added to. > ep_poll_callback isn't nested specific? > So I was thinking that ep_poll() would see these events on the ovflist without an explicit wakeup, b/c the overflow list being active implies that the ep_poll() path would add them to the rdllist in ep_scan_read_list(). Thus, it will see the events either in the current ep_poll() context or via a subsequent syscall to epoll_wait(). However, there are other paths that can call ep_scan_ready_list() thus I agree with you that both wakeups here are necessary. I do think are are still (smaller) potential races in ep_scan_ready_list() where we have: write_lock_irq(&ep->lock); list_splice_init(&ep->rdllist, &txlist); WRITE_ONCE(ep->ovflist, NULL); write_unlock_irq(&ep->lock); And in the ep_poll path we have: static inline int ep_events_available(struct eventpoll *ep) { return !list_empty_careful(&ep->rdllist) || READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR; } Seems to me that first bit needs to be the following, since ep_events_available() is now checked in a lockless way: write_lock_irq(&ep->lock); WRITE_ONCE(ep->ovflist, NULL); smp_wmb(); list_splice_init(&ep->rdllist, &txlist); write_unlock_irq(&ep->lock); And also this bit: WRITE_ONCE(ep->ovflist, EP_UNACTIVE_PTR); /* * Quickly re-inject items left on "txlist". */ list_splice(&txlist, &ep->rdllist); Should I think better be reversed as well to: list_splice(&txlist, &ep->rdllist); smp_wmb(); WRITE_ONCE(ep->ovflist, EP_UNACTIVE_PTR); I can send those as a separate patch followup. Thanks, -Jason >>> We noticed that one of our workloads was missing wakeups starting with >>> 339ddb53d373 and upon manual inspection, this wakeup seemed missing to >>> me. With this patch added, we no longer see missing wakeups. I haven't >>> yet tried to make a small reproducer, but the existing kselftests in >>> filesystem/epoll passed for me with this patch. >>> >>> Fixes: 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested epoll") >>> >>> Signed-off-by: Khazhismel Kumykov <khazhy@xxxxxxxxxx> >>> Reviewed-by: Roman Penyaev <rpenyaev@xxxxxxx> >>> Cc: Alexander Viro <viro@xxxxxxxxxxxxxxxxxx> >>> Cc: Heiher <r@xxxxxx> >>> Cc: Jason Baron <jbaron@xxxxxxxxxx> >>> Cc: <stable@xxxxxxxxxxxxxxx> >>> --- >>> v2: use if/elif instead of goto + cleanup suggested by Roman >>> fs/eventpoll.c | 18 +++++++++--------- >>> 1 file changed, 9 insertions(+), 9 deletions(-) >>> >>> diff --git a/fs/eventpoll.c b/fs/eventpoll.c >>> index 8c596641a72b..d6ba0e52439b 100644 >>> --- a/fs/eventpoll.c >>> +++ b/fs/eventpoll.c >>> @@ -1171,6 +1171,10 @@ static inline bool chain_epi_lockless(struct epitem *epi) >>> { >>> struct eventpoll *ep = epi->ep; >>> >>> + /* Fast preliminary check */ >>> + if (epi->next != EP_UNACTIVE_PTR) >>> + return false; >>> + >>> /* Check that the same epi has not been just chained from another CPU */ >>> if (cmpxchg(&epi->next, EP_UNACTIVE_PTR, NULL) != EP_UNACTIVE_PTR) >>> return false; >>> @@ -1237,16 +1241,12 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v >>> * chained in ep->ovflist and requeued later on. >>> */ >>> if (READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR) { >>> - if (epi->next == EP_UNACTIVE_PTR && >>> - chain_epi_lockless(epi)) >>> + if (chain_epi_lockless(epi)) >>> + ep_pm_stay_awake_rcu(epi); >>> + } else if (!ep_is_linked(epi)) { >>> + /* In the usual case, add event to ready list. */ >>> + if (list_add_tail_lockless(&epi->rdllink, &ep->rdllist)) >>> ep_pm_stay_awake_rcu(epi); >>> - goto out_unlock; >>> - } >>> - >>> - /* If this file is already in the ready list we exit soon */ >>> - if (!ep_is_linked(epi) && >>> - list_add_tail_lockless(&epi->rdllink, &ep->rdllist)) { >>> - ep_pm_stay_awake_rcu(epi); >>> } >>> >>> /* >>>