On Thu, May 14, 2015 at 10:49:06AM -0700, Linus Torvalds wrote: > On Thu, May 14, 2015 at 10:31 AM, Andrea Arcangeli <aarcange@xxxxxxxxxx> wrote: > > +static __always_inline void wake_userfault(struct userfaultfd_ctx *ctx, > > + struct userfaultfd_wake_range *range) > > +{ > > + if (waitqueue_active(&ctx->fault_wqh)) > > + __wake_userfault(ctx, range); > > +} > > Pretty much every single time people use this "if > (waitqueue_active())" model, it tends to be a bug, because it means > that there is zero serialization with people who are just about to go > to sleep. It's fundamentally racy against all the "wait_event()" loops > that carefully do memory barriers between testing conditions and going > to sleep, because the memory barriers now don't exist on the waking > side. As far as 10/23 is concerned, the __wake_userfault taking the locks would also ignore any "incoming" not yet "read(2)" userfault. "read(2)" as in syscall read. So there was no race to worry about as "incoming" userfaults would be ignored anyway by the wake. The only case that had to be reliable in not missing wakeup events was the "release" file operation. But that doesn't use waitqueue_active and it relies on ctx->released combined with mmap_sem taken for writing. http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/tree/fs/userfaultfd.c?h=userfault&id=2f73ffa8267e41f04fe6b0d93d23feed45c0980a#n267 However later (after I started documenting what userland should do) I started to question this old model of leaving the race handling to userland. It's way too complex to leave the race handling to userland. Furthermore it's inefficient because there would be lots of spurious userfaults that we could have been waken up within the kernel before userland could have a chance to read them. So then I handled the race in the kernel in patch 17/23. That allowed to drop hundres of line of locking code from qemu (two bits per page and a mutex and lots of complexity) and then I didn't need to document the complex rules described in this commit header: http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?h=userfault&id=5a2b3614e107482da9a1fcfcd120d30eb62d45dc You can see from the patch it complicated the kernel by adding a pagetable walk, but it's worth it because it's simpler than solving it in userland, plus its solved at once for all users. The only cons is that the old model would allow userland to be way more consistent in enforcing asserts as it had the control. With the current model POLLIN may be returned from poll() despite the later read returns -EAGAIN, so it basically requires a non blocking open if people uses poll(). (blocking reads without poll are still fine) Now back to your question: the waitqueue_active is still there in the current model as well (since 17/23). Since 17/23 losing a wakeup (as far as qemu is concerned) would mean that qemu would read the address of the fault that didn't get waken up because of the race, then it would send a page request to the source node (it had no state in the destination node where the userfault runs to know if it was a "dup"), which would discard the request (not sending the page twice) noticing in its simple per-page bitmap that it was already sent. So with the new model after 17/23 qemu, losing a wakeup is a bug. The wait_event is like this: handle_userfault (wait_event) ------- spin_lock(&ctx->fault_pending_wqh.lock); /* * After the __add_wait_queue the uwq is visible to userland * through poll/read(). */ __add_wait_queue(&ctx->fault_pending_wqh, &uwq.wq); /* * The smp_mb() after __set_current_state prevents the reads * following the spin_unlock to happen before the list_add in * __add_wait_queue. */ set_current_state(TASK_KILLABLE); spin_unlock(&ctx->fault_pending_wqh.lock); must_wait = userfaultfd_must_wait(ctx, address, flags, reason); up_read(&mm->mmap_sem); if (likely(must_wait && !ACCESS_ONCE(ctx->released) && !fatal_signal_pending(current))) { wake_up_poll(&ctx->fd_wqh, POLLIN); schedule(); The wakeup side is: userfaultfd_copy ----------- ret = mcopy_atomic(ctx->mm, uffdio_copy.dst, uffdio_copy.src, uffdio_copy.len); if (unlikely(put_user(ret, &user_uffdio_copy->copy))) return -EFAULT; if (ret < 0) goto out; BUG_ON(!ret); /* len == 0 would wake all */ range.len = ret; if (!(uffdio_copy.mode & UFFDIO_COPY_MODE_DONTWAKE)) { range.start = uffdio_copy.dst; wake_userfault(ctx, &range); wake_userfault is the function that is using waitqueue_active so the problem materializes if waitqueue_active moves before mcopy_atomic. Precisely it should move before the set_pte_at below: mcopy_atomic_pte ------ set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte); /* No need to invalidate - it was non-present before */ update_mmu_cache(dst_vma, dst_addr, dst_pte); pte_unmap_unlock(dst_pte, ptl); unlock would allow the read to be reordered before it even on x86. There's an up_read as well in between but it has the same problem. So you're right that theoretically we can miss a wakeup. Practically it sounds unlikely because of the sheer size of mcopy_atomic and we never experienced it but it's still a bug. > So I'm making a new rule: if you use waitqueue_active(), I want an > explanation for why it's not racy with the waiter. A big comment > about the memory ordering, or about higher-level locks that are held > by the caller, or something. Linus To fix it I added this along a comment: diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index c89e96f..6be316b 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -593,6 +593,21 @@ static void __wake_userfault(struct userfaultfd_ctx *ctx, static __always_inline void wake_userfault(struct userfaultfd_ctx *ctx, struct userfaultfd_wake_range *range) { + /* + * To be sure waitqueue_active() is not reordered by the CPU + * before the pagetable update, use an explicit SMP memory + * barrier here. PT lock release or up_read(mmap_sem) still + * have release semantics that can allow the + * waitqueue_active() to be reordered before the pte update. + */ + smp_mb(); + + /* + * Use waitqueue_active because it's very frequent to + * change the address space atomically even if there are no + * userfaults yet. So we take the spinlock only when we're + * sure we've userfaults to wake. + */ if (waitqueue_active(&ctx->fault_pending_wqh) || waitqueue_active(&ctx->fault_wqh)) __wake_userfault(ctx, range); The wait_event/handle_userfault side already has a smp_mb() to prevent the lockless pagetable walk to be reordered before the list_add in __add_wait_queue (needed as well precisely because of the waitqueue_active optimization that I'd like to keep). Thanks, Andrea -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html