On 2/16/2024 12:53 AM, Anish Moorthy wrote:
This series adds an option to cause stage-2 fault handlers to
KVM_MEMORY_FAULT_EXIT when they would otherwise be required to fault in
the userspace mappings. Doing so allows userspace to receive stage-2
faults directly from KVM_RUN instead of through userfaultfd, which
suffers from serious contention issues as the number of vCPUs scales.
Thanks for your work!
:D
So, this is an alternative approach userspace like Qemu to do post copy
live migration using KVM_MEMORY_FAULT_EXIT instead of userfaultfd which
seems slower with more vCPU's.
Maybe I am missing some things here, just curious how userspace VMM e.g
Qemu would do memory copy with this approach once the page is available
from remote host which was done with UFFDIO_COPY earlier?
This new capability is meant to be used *alongside* userfaultfd during
post-copy: it's not a replacement. KVM_RUN can generate page faults
from outside the stage-2 fault handlers (IIUC instruction emulation is
one source), and these paths are unchanged: so it's important that
userspace still UFFDIO_REGISTERs KVM's mapping and reads from the UFFD
to catch these guest accesses. But with the new
KVM_MEM_EXIT_ON_MISSING memslot flag set, the stage-2 handlers will
report needing to fault in memory via KVM_MEMORY_FAULT_EXIT instead of
queuing onto the UFFD.
In the workloads I've tested, the vast majority of guest-generated
page faults (99%+) come from the stage-2 handlers. So this series
"solves" the issue of contention on the UFFD file descriptor by
(mostly) sidestepping it.
As for how userspace actually uses the new functionality: when a vCPU
thread receives a KVM_MEMORY_FAULT_EXIT for an unfetched page during
post-copy it might
(a) Fetch the page
(b) Install the page into KVM's mapping via UFFDIO_COPY (don't
necessarily need to UFFDIO_WAKE!)
(c) Call KVM_RUN to re-enter the guest and retry the access. The
stage-2 fault handler will fire again but almost certainly won't
KVM_MEMORY_FAULT_EXIT now (since the UFFDIO_COPY will have mapped the
page), so the guest can continue.
and userspace can continue using some thread(s) to
(a) Read page faults from the UFFD.
(b) Install the page using UFFDIO_COPY + UFFDIO_WAKE
(c) goto (a)
to make sure it catches everything. The combination of these two things
adds up to more performant "uffd-based" postcopy.
I'm of course skimming over some details (e.g.: when two vCPU threads
race to fetch a page one of them should probably MADV_POPULATE_WRITE
somehow), but I hope this is helpful. My patch to the KVM demand
paging self test might also clarify things a bit [1].
One other small detail is, you can equally use UFFDIO_CONTINUE,
depending on how the rest of the live migration implementation works.
Really briefly, this series should be viewed as an alternate (and more
scalable) mechanism to find out that a fault occurred. The way
userspace then *resolves* the fault (whether via UFFDIO_COPY or
UFFDIO_CONTINUE) can remain the same as before.
That clarifies. Thank you!
Best regards,
Pankaj