From: Andrea Arcangeli <aarcange@xxxxxxxxxx> When a new message is generated for an userfaultfd, instead of waking up all the readers, we can wake up only one exclusive reader to process the event. Waking up >1 readers for 1 message will be a waste of resource, where the rest readers will see nothing again and re-queue. This should make userfaultfd read() O(1) on wakeups. Note that queuing on head is intended (rather than tail) to make sure the readers are waked up in LIFO fashion; fairness doesn't matter much here, but caching does. Signed-off-by: Andrea Arcangeli <aarcange@xxxxxxxxxx> [peterx: modified subjects / commit message] Signed-off-by: Peter Xu <peterx@xxxxxxxxxx> --- fs/userfaultfd.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 56eaae9dac1a..f7fda7d0c994 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -1061,7 +1061,11 @@ static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx *ctx, int no_wait, /* always take the fd_wqh lock before the fault_pending_wqh lock */ spin_lock_irq(&ctx->fd_wqh.lock); - __add_wait_queue(&ctx->fd_wqh, &wait); + /* + * Only wake up one exclusive reader each time there's an event. + * Paired with wake_up_poll() when e.g. a new page fault msg generated. + */ + __add_wait_queue_exclusive(&ctx->fd_wqh, &wait); for (;;) { set_current_state(TASK_INTERRUPTIBLE); spin_lock(&ctx->fault_pending_wqh.lock); -- 2.41.0