On Wed, 21 Jul 2021 12:59:39 +0200 Paolo Bonzini wrote: >On 21/07/21 12:11, Hillf Danton wrote: >> On Wed, 21 Jul 2021 09:25:32 +0200 Thomas Gleixner wrote: >>> On Wed, Jul 21 2021 at 15:04, Hillf Danton wrote: >>>> >>>> But the preempting waker can not make sense without the waiter who is bloody >>>> special. Why is it so in the first place? Or it is not at all but the race >>>> existing from Monday to Friday. >>> >>> See the large comment in eventfd_poll(). >> >> Is it likely for a reader to make eventfd_poll() return 0? >> >> read * poll write >> ---- * ----------------- ------------ >> * count = ctx->count (INVALID!) >> * lock ctx->qwh.lock >> * ctx->count += n >> * **waitqueue_active is false** >> * **no wake_up_locked_poll!** >> * unlock ctx->qwh.lock >> >> lock ctx->qwh.lock >> *cnt = (ctx->flags & EFD_SEMAPHORE) ? 1 : ctx->count; >> ctx->count -= *cnt; >> **waitqueue_active is false** >> unlock ctx->qwh.lock >> >> * lock ctx->wqh.lock (in poll_wait) >> * __add_wait_queue >> * unlock ctx->wqh.lock >> * eventfd_poll returns 0 >> */ >> count = READ_ONCE(ctx->count); >> > >No, it's simply impossible. The same comment explains why: "count = >ctx->count" cannot move above poll_wait's locking of ctx->wqh.lock. Detect concurrent reader and writer by reading event counter before and after poll_wait(), and determine feedback with the case of unstable counter taken into account. Cut the big comment as the added barriers speak for themselves. +++ x/fs/eventfd.c @@ -131,49 +131,20 @@ static __poll_t eventfd_poll(struct file { struct eventfd_ctx *ctx = file->private_data; __poll_t events = 0; - u64 count; + u64 c0, count; + + c0 = ctx->count; + smp_rmb(); poll_wait(file, &ctx->wqh, wait); - /* - * All writes to ctx->count occur within ctx->wqh.lock. This read - * can be done outside ctx->wqh.lock because we know that poll_wait - * takes that lock (through add_wait_queue) if our caller will sleep. - * - * The read _can_ therefore seep into add_wait_queue's critical - * section, but cannot move above it! add_wait_queue's spin_lock acts - * as an acquire barrier and ensures that the read be ordered properly - * against the writes. The following CAN happen and is safe: - * - * poll write - * ----------------- ------------ - * lock ctx->wqh.lock (in poll_wait) - * count = ctx->count - * __add_wait_queue - * unlock ctx->wqh.lock - * lock ctx->qwh.lock - * ctx->count += n - * if (waitqueue_active) - * wake_up_locked_poll - * unlock ctx->qwh.lock - * eventfd_poll returns 0 - * - * but the following, which would miss a wakeup, cannot happen: - * - * poll write - * ----------------- ------------ - * count = ctx->count (INVALID!) - * lock ctx->qwh.lock - * ctx->count += n - * **waitqueue_active is false** - * **no wake_up_locked_poll!** - * unlock ctx->qwh.lock - * lock ctx->wqh.lock (in poll_wait) - * __add_wait_queue - * unlock ctx->wqh.lock - * eventfd_poll returns 0 - */ - count = READ_ONCE(ctx->count); + smp_rmb(); + count = ctx->count; + + if (c0 < count) + return EPOLLIN; + if (c0 > count) + return EPOLLOUT; if (count > 0) events |= EPOLLIN;