Re: [RFC] Improving userfaultfd scalability for live migration

David Matlack <dmatlack@xxxxxxxxxx> · Wed, 7 Dec 2022 17:56:43 -0800

On Tue, Dec 6, 2022 at 12:41 PM James Houghton <jthoughton@xxxxxxxxxx> wrote:
> On Tue, Dec 6, 2022 at 1:01 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> > Can you elaborate on what makes it better?  Or maybe generate a list of pros and
> > cons?  I can think of (dis)advantages for both approaches, but I haven't identified
> > anything that would be a blocking issue for either approach.  Doesn't mean there
> > isn't one or more blocking issues, just that I haven't thought of any :-)
>
> Let's see.... so using no-slow-GUP over no UFFD waiting:
> - No need to take mmap_lock in mem fault path.
> - Change the relevant __gfn_to_pfn_memslot callers
> (kvm_faultin_pfn/user_mem_abort/others?) to set `atomic = true` if the
> new CAP is used.
> - No need for a new PF_NO_UFFD_WAIT (would be toggled somewhere
> in/near kvm_faultin_pfn/user_mem_abort).
> - Userspace has to indirectly figure out the state of the page tables
> to know what action to take (which introduces some weirdness, like if
> anyone MADV_DONTNEEDs some guest memory, we need to know).

I'm no expert but I believe a guest access to MADV_DONTNEED'd GFN
would just cause a new page to be allocated by the kernel. So I think
userspace can still blindly do MADV_POPULATE_WRITE in this case. Were
there any other scenarios you had in mind?

> - While userfaultfd is registered (so like during post-copy), any
> hva_to_pfn() calls that were resolvable with slow GUP before (without
> dropping into handle_userfault()) will now need to be resolved by
> userspace manually with a call to MADV_POPULATE_WRITE. This extra trip
> to userspace could slow things down.

Is there any way to enable fast-gup to identify when a PTE is not
present due to userfaultfd specifically without taking the mmap_lock
(e.g. using an unused bit in the PTE)? Then we could avoid extra trips
to userspace for MADV_POPULATE_WRITE.

>
> Both of these seem pretty simple to implement in the kernel; the most
> complicated part is just returning KVM_EXIT_MEMORY_FAULT in more
> places / for other architectures (I care about x86 and arm64).
>
> Right now both approaches seem fine to me. Not having to take the
> mmap_lock in the fault path, while being such a minor difference now,
> could be a huge benefit if we can later get around to making
> UFFDIO_CONTINUE not need the mmap lock. Disregarding that, not
> requiring userspace to guess the state of the page tables seems
> helpful (less bug-prone, I guess).
>
> >
> > > When KVM_RUN exits:
> > > - If we haven't UFFDIO_CONTINUE'd yet, do that now and restart KVM_RUN.
> > > - If we have, then something bad has happened. Slow GUP already ran
> > > and failed, so we need to treat this in the same way we treat a
> > > MADV_POPULATE_WRITE failure above: userspace might just want to crash
> > > (or inject a memory error or something).
> > >
> > > - James