Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.

Axel Rasmussen <axelrasmussen@xxxxxxxxxx> · Thu, 11 May 2023 10:33:24 -0700

On Thu, May 11, 2023 at 10:18 AM David Matlack <dmatlack@xxxxxxxxxx> wrote:
>
> On Wed, May 10, 2023 at 2:50 PM Peter Xu <peterx@xxxxxxxxxx> wrote:
> > On Tue, May 09, 2023 at 01:52:05PM -0700, Anish Moorthy wrote:
> > > On Sun, May 7, 2023 at 6:23 PM Peter Xu <peterx@xxxxxxxxxx> wrote:
> >
> > What I wanted to do is to understand whether there's still chance to
> > provide a generic solution.  I don't know why you have had a bunch of pmu
> > stack showing in the graph, perhaps you forgot to disable some of the perf
> > events when doing the test?  Let me know if you figure out why it happened
> > like that (so far I didn't see), but I feel guilty to keep overloading you
> > with such questions.
> >
> > The major problem I had with this series is it's definitely not a clean
> > approach.  Say, even if you'll all rely on userapp you'll still need to
> > rely on userfaultfd for kernel traps on corner cases or it just won't work.
> > IIUC that's also the concern from Nadav.
>
> This is a long thread, so apologies if the following has already been discussed.
>
> Would per-tid userfaultfd support be a generic solution? i.e. Allow
> userspace to create a userfaultfd that is tied to a specific task. Any
> userfaults encountered by that task use that fd, rather than the
> process-wide fd. I'm making the assumption here that each of these fds
> would have independent signaling mechanisms/queues and so this would
> solve the scaling problem.
>
> A VMM could use this to create 1 userfaultfd per vCPU and 1 thread per
> vCPU for handling userfault requests. This seems like it'd have
> roughly the same scalability characteristics as the KVM -EFAULT
> approach.

I think this would work in principle, but it's significantly different
from what exists today.

The splitting of userfaultfds Peter is describing is splitting up the
HVA address space, not splitting per-thread.

I think for this design, we'd need to change UFFD registration so
multiple UFFDs can register the same VMA, but can be filtered so they
only receive fault events caused by some particular tid(s).

This might also incur some (small?) overhead, because in the fault
path we now need to maintain some data structure so we can lookup
which UFFD to notify based on a combination of the address and our
tid. Today, since VMAs and UFFDs are 1:1 this lookup is trivial.

I think it's worth keeping in mind that a selling point of Anish's
approach is that it's a very small change. It's plausible we can come
up with some alternative way to scale, but it seems to me everything
suggested so far is likely to require a lot more code, complexity, and
effort vs. Anish's approach.