Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.

David Matlack <dmatlack@xxxxxxxxxx> · Thu, 11 May 2023 12:05:10 -0700

On Thu, May 11, 2023 at 10:33:24AM -0700, Axel Rasmussen wrote:
> On Thu, May 11, 2023 at 10:18 AM David Matlack <dmatlack@xxxxxxxxxx> wrote:
> >
> > On Wed, May 10, 2023 at 2:50 PM Peter Xu <peterx@xxxxxxxxxx> wrote:
> > > On Tue, May 09, 2023 at 01:52:05PM -0700, Anish Moorthy wrote:
> > > > On Sun, May 7, 2023 at 6:23 PM Peter Xu <peterx@xxxxxxxxxx> wrote:
> > >
> > > What I wanted to do is to understand whether there's still chance to
> > > provide a generic solution.  I don't know why you have had a bunch of pmu
> > > stack showing in the graph, perhaps you forgot to disable some of the perf
> > > events when doing the test?  Let me know if you figure out why it happened
> > > like that (so far I didn't see), but I feel guilty to keep overloading you
> > > with such questions.
> > >
> > > The major problem I had with this series is it's definitely not a clean
> > > approach.  Say, even if you'll all rely on userapp you'll still need to
> > > rely on userfaultfd for kernel traps on corner cases or it just won't work.
> > > IIUC that's also the concern from Nadav.
> >
> > This is a long thread, so apologies if the following has already been discussed.
> >
> > Would per-tid userfaultfd support be a generic solution? i.e. Allow
> > userspace to create a userfaultfd that is tied to a specific task. Any
> > userfaults encountered by that task use that fd, rather than the
> > process-wide fd. I'm making the assumption here that each of these fds
> > would have independent signaling mechanisms/queues and so this would
> > solve the scaling problem.
> >
> > A VMM could use this to create 1 userfaultfd per vCPU and 1 thread per
> > vCPU for handling userfault requests. This seems like it'd have
> > roughly the same scalability characteristics as the KVM -EFAULT
> > approach.
> 
> I think this would work in principle, but it's significantly different
> from what exists today.
> 
> The splitting of userfaultfds Peter is describing is splitting up the
> HVA address space, not splitting per-thread.
> 
> I think for this design, we'd need to change UFFD registration so
> multiple UFFDs can register the same VMA, but can be filtered so they
> only receive fault events caused by some particular tid(s).
> 
> This might also incur some (small?) overhead, because in the fault
> path we now need to maintain some data structure so we can lookup
> which UFFD to notify based on a combination of the address and our
> tid. Today, since VMAs and UFFDs are 1:1 this lookup is trivial.

I was (perhaps naively) assuming the lookup would be as simple as:

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 44d1ee429eb0..e9856e2ba9ef 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -417,7 +417,10 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
         */
        mmap_assert_locked(mm);

-       ctx = vma->vm_userfaultfd_ctx.ctx;
+       if (current->userfaultfd_ctx)
+               ctx = current->userfaultfd_ctx;
+       else
+               ctx = vma->vm_userfaultfd_ctx.ctx;
        if (!ctx)
                goto out;

> 
> I think it's worth keeping in mind that a selling point of Anish's
> approach is that it's a very small change. It's plausible we can come
> up with some alternative way to scale, but it seems to me everything
> suggested so far is likely to require a lot more code, complexity, and
> effort vs. Anish's approach.

Agreed.

Mostly I think the per-thread UFFD approach would add complexity on the
userspace side of things. With Anish's approach userspace is able to
trivially re-use the vCPU thread (and it's associated pCPU if pinned) to
handle the request. That gets more complicated when juggling the extra
paired threads.

The per-thread approach would requires a new userfault UAPI change which
I think is a higher bar than the KVM UAPI change proposed here.

The per-thread approach would require KVM call into slow GUP and take
the mmap_lock before contacting userspace. I'm not 100% convinced that's
a bad thing long term (e.g. it avoids the false-positive -EFAULT exits
in Anish's proposal), but could have performance implications.

Lastly, inter-thread communication is likely slower than returning to
userspace from KVM_RUN. So the per-thread approach might increase the
end-to-end latency of demand fetches.