Using userfaultfd with KVM's async page fault handling causes processes to hung waiting for mmap_lock to be released

Dimitris Siakavaras <jimsiak@xxxxxxxxxxxxxxxxx> · Tue, 18 Jul 2023 17:33:12 +0300

Hi, this is my first bug report so I apologise in advance for any 
missing information and/or difficulty in explaining the problem in my 
email. I am at your disposal to provide any other necessary information 
or modify appropriately my email.

Problem: Using userfaultfd for a process that uses KVM and triggers the 
asynchronous page fault handling results in processes to hung forever.
Processor: AMD EPYC 7402 24-Core Processor
Kernel version: 5.13 (the problem also occurs on 6.4.3 and 6.5-rc2)

Unfortunately, my execution environment involves a pretty complex set of 
components to setup so it is not straightforward for me to share code 
that can be used to reproduce the issue, so I will try to explain the 
problem as clearly as possible.

I have two processes:
1. A firecracker VM process (https://firecracker-microvm.github.io/) 
which uses KVM.
2. A second process that handles the userpage faults of the firecracker 
process.

The race condition involves the released field of the userfaultfd_ctx 
structure.
More specifically:

* Process 2 invokes the close() system call for the userfaultfd 
descriptor, thus triggering the execution of userfaultfd_release() in 
the kernel.
  userfaultfd_release() contains the following lines of code:

   WRITE_ONCE(ctx->released, true);

    if (!mmget_not_zero(mm))
        goto wakeup;

    /*
     * Flush page faults out of all CPUs. NOTE: all page faults
     * must be retried without returning VM_FAULT_SIGBUS if
     * userfaultfd_ctx_get() succeeds but vma->vma_userfault_ctx
     * changes while handle_userfault released the mmap_lock. So
     * it's critical that released is set to true (above), before
     * taking the mmap_lock for writing.
     */
    mmap_write_lock(mm);

* Process 1 is getting a page fault while running inside KVM_ENTRY. This 
triggers the execution of kvm_tdp_page_fault(), and the following 
function call chain is executed:

kvm_tdp_page_fault() -> direct_page_fault() -> try_async_pf() -> 
kvm_arch_setup_async_pf() -> kvm_setup_async_pf()

kvm_setup_async_pf() adds in the workqueue function async_pf_execute:
    INIT_WORK(&work->work, async_pf_execute);

Then, the following function call chain is executed:
async_pf_execute() -> get_user_pages_remote() -> 
__get_user_pages_remote() -> __get_user_pages_locked() -> __get_user_pages()

__get_user_pages() is called with mmap_lock taken and in there is the 
following code:
retry:
        /*
         * If we have a pending SIGKILL, don't keep faulting pages and
         * potentially allocating memory.
         */
        if (fatal_signal_pending(current)) {
            ret = -EINTR;
            goto out;
        }
        cond_resched();

        page = follow_page_mask(vma, start, foll_flags, &ctx);
        if (!page) {
            ret = faultin_page(vma, start, &foll_flags, locked);
            switch (ret) {
            case 0:
                goto retry;

When faultin_page() is called here it will in turn call the following 
chain of functions:

faultin_page() -> handle_mm_fault() -> __handle__mm_fault() -> 
handle_pte_fault() -> do_anonymous_page() -> handle_userfault()

The final handle_userfault() function is the function used by 
userfaultfd to handle the userfault. In this function we can find the 
following code:

if (unlikely(READ_ONCE(ctx->released))) {
        /*
         * Don't return VM_FAULT_SIGBUS in this case, so a non
         * cooperative manager can close the uffd after the
         * last UFFDIO_COPY, without risking to trigger an
         * involuntary SIGBUS if the process was starting the
         * userfaultfd while the userfaultfd was still armed
         * (but after the last UFFDIO_COPY). If the uffd
         * wasn't already closed when the userfault reached
         * this point, that would normally be solved by
         * userfaultfd_must_wait returning 'false'.
         *
         * If we were to return VM_FAULT_SIGBUS here, the non
         * cooperative manager would be instead forced to
         * always call UFFDIO_UNREGISTER before it can safely
         * close the uffd.
         */
        ret = VM_FAULT_NOPAGE;
        goto out;
}

The problem is that when ctx->released has been set to 1 by 
userfaultfd_release() called by Process 2, handle_userfault() will 
return VM_FAULT_NOPAGE due to the above if statement.
This will result in VM_FAULT_NOPAGE returned by handle_mm_fault() in 
faultin_page() and faultin_page() in turn will return 0.
Getting back to the invocation of faultin_page() from __get_user_pages() 
the "case 0:" statement will cause the execution to go back to the retry 
label. Given that ctx->released never turns back to 0, this loop will 
continue forever and Process 1 will be stuck calling faultin_page(), 
getting 0 as return value, going back to retry, and so on.

Given that Process 1 still holds the mmap_lock and will never release 
it, process 2 will also hang in the call of mmap_write_lock(mm).

This results in both processes being stuck in a deadlock/livelock situation.

Unfortunately, I have only a minor knowledge of the mm kernel subsystem 
so I am not able to provide a solution to the problem, but I hope 
someone else with experience in kernel developing can come up with a 
proper solution.

Thank you very much,
Best Regards,
Dimitris Siakavaras