On 4/30/20 9:03 AM, Roman Penyaev wrote: > This patch does two things: > > 1. fixes lost wakeup introduced by: > 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested epoll") > > 2. improves performance for events delivery. > > The description of the problem is the following: if N (>1) threads > are waiting on ep->wq for new events and M (>1) events come, it is > quite likely that >1 wakeups hit the same wait queue entry, because > there is quite a big window between __add_wait_queue_exclusive() and > the following __remove_wait_queue() calls in ep_poll() function. This > can lead to lost wakeups, because thread, which was woken up, can > handle not all the events in ->rdllist. (in better words the problem > is described here: https://lkml.org/lkml/2019/10/7/905) > > The idea of the current patch is to use init_wait() instead of > init_waitqueue_entry(). Internally init_wait() sets > autoremove_wake_function as a callback, which removes the wait entry > atomically (under the wq locks) from the list, thus the next coming > wakeup hits the next wait entry in the wait queue, thus preventing > lost wakeups. > > Problem is very well reproduced by the epoll60 test case [1]. > > Wait entry removal on wakeup has also performance benefits, because > there is no need to take a ep->lock and remove wait entry from the > queue after the successful wakeup. Here is the timing output of > the epoll60 test case: > > With explicit wakeup from ep_scan_ready_list() (the state of the > code prior 339ddb53d373): > > real 0m6.970s > user 0m49.786s > sys 0m0.113s > > After this patch: > > real 0m5.220s > user 0m36.879s > sys 0m0.019s > > The other testcase is the stress-epoll [2], where one thread consumes > all the events and other threads produce many events: > > With explicit wakeup from ep_scan_ready_list() (the state of the > code prior 339ddb53d373): > > threads events/ms run-time ms > 8 5427 1474 > 16 6163 2596 > 32 6824 4689 > 64 7060 9064 > 128 6991 18309 > > After this patch: > > threads events/ms run-time ms > 8 5598 1429 > 16 7073 2262 > 32 7502 4265 > 64 7640 8376 > 128 7634 16767 > > (number of "events/ms" represents event bandwidth, thus higher is > better; number of "run-time ms" represents overall time spent > doing the benchmark, thus lower is better) > > [1] tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c > [2] https://github.com/rouming/test-tools/blob/master/stress-epoll.c > > Signed-off-by: Roman Penyaev <rpenyaev@xxxxxxx> > Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> > Cc: Khazhismel Kumykov <khazhy@xxxxxxxxxx> > Cc: Alexander Viro <viro@xxxxxxxxxxxxxxxxxx> > Cc: Heiher <r@xxxxxx> > Cc: Jason Baron <jbaron@xxxxxxxxxx> > Cc: stable@xxxxxxxxxxxxxxx > Cc: linux-fsdevel@xxxxxxxxxxxxxxx > Cc: linux-kernel@xxxxxxxxxxxxxxx > --- > fs/eventpoll.c | 43 ++++++++++++++++++++++++------------------- > 1 file changed, 24 insertions(+), 19 deletions(-) > Looks good to me and nice speedups. Reviewed-by: Jason Baron <jbaron@xxxxxxxxxx> Thanks, -Jason