Re: [RFC PATCH 00/18] KVM: Post-copy live migration for guest_memfd

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Jul 15, 2024 at 8:28 AM Wang, Wei W <wei.w.wang@xxxxxxxxx> wrote:
>
> On Thursday, July 11, 2024 7:42 AM, James Houghton wrote:
> > This patch series implements the KVM-based demand paging system that was
> > first introduced back in November[1] by David Matlack.
> >
> > The working name for this new system is KVM Userfault, but that name is very
> > confusing so it will not be the final name.
> >
> Hi James,
> I had implemented a similar approach for TDX post-copy migration, there are quite
> some differences though. Got some questions about your design below.

Thanks for the feedback!!

>
> > Problem: post-copy with guest_memfd
> > ===================================
> >
> > Post-copy live migration makes it possible to migrate VMs from one host to
> > another no matter how fast they are writing to memory while keeping the VM
> > paused for a minimal amount of time. For post-copy to work, we
> > need:
> >  1. to be able to prevent KVM from being able to access particular pages
> >     of guest memory until we have populated it  2. for userspace to know when
> > KVM is trying to access a particular
> >     page.
> >  3. a way to allow the access to proceed.
> >
> > Traditionally, post-copy live migration is implemented using userfaultfd, which
> > hooks into the main mm fault path. KVM hits this path when it is doing HVA ->
> > PFN translations (with GUP) or when it itself attempts to access guest memory.
> > Userfaultfd sends a page fault notification to userspace, and KVM goes to sleep.
> >
> > Userfaultfd works well, as it is not specific to KVM; everyone who attempts to
> > access guest memory will block the same way.
> >
> > However, with guest_memfd, we do not use GUP to translate from GFN to HPA
> > (nor is there an intermediate HVA).
> >
> > So userfaultfd in its current form cannot be used to support post-copy live
> > migration with guest_memfd-backed VMs.
> >
> > Solution: hook into the gfn -> pfn translation
> > ==============================================
> >
> > The only way to implement post-copy with a non-KVM-specific userfaultfd-like
> > system would be to introduce the concept of a file-userfault[2] to intercept
> > faults on a guest_memfd.
> >
> > Instead, we take the simpler approach of adding a KVM-specific API, and we
> > hook into the GFN -> HVA or GFN -> PFN translation steps (for traditional
> > memslots and for guest_memfd respectively).
>
>
> Why taking KVM_EXIT_MEMORY_FAULT faults for the traditional shared
> pages (i.e. GFN -> HVA)?
> It seems simpler if we use KVM_EXIT_MEMORY_FAULT for private pages only, leaving
> shared pages to go through the existing userfaultfd mechanism:
> - The need for “asynchronous userfaults,” introduced by patch 14, could be eliminated.
> - The additional support (e.g., KVM_MEMORY_EXIT_FLAG_USERFAULT) for private page
>   faults exiting to userspace for postcopy might not be necessary, because all pages on the
>   destination side are initially “shared,” and the guest’s first access will always cause an
>   exit to userspace for shared->private conversion. So VMM is able to leverage the exit to
>   fetch the page data from the source (VMM can know if a page data has been fetched
>   from the source or not).

You're right that, today, including support for guest-private memory
*only* indeed simplifies things (no async userfaults). I think your
strategy for implementing post-copy would work (so, shared->private
conversion faults for vCPU accesses to private memory, and userfaultfd
for everything else).

I'm not 100% sure what should happen in the case of a non-vCPU access
to should-be-private memory; today it seems like KVM just provides the
shared version of the page, so conventional use of userfaultfd
shouldn't break anything.

But eventually guest_memfd itself will support "shared" memory, and
(IIUC) it won't use VMAs, so userfaultfd won't be usable (without
changes anyway). For a non-confidential VM, all memory will be
"shared", so shared->private conversions can't help us there either.
Starting everything as private almost works (so using private->shared
conversions as a notification mechanism), but if the first time KVM
attempts to use a page is not from a vCPU (and is from a place where
we cannot easily return to userspace), the need for "async userfaults"
comes back.

For this use case, it seems cleaner to have a new interface. (And, as
far as I can tell, we would at least need some kind of "async
userfault"-like mechanism.)

Another reason why, today, KVM Userfault is helpful is that
userfaultfd has a couple drawbacks. Userfaultfd migration with
HugeTLB-1G is basically unusable, as HugeTLB pages cannot be mapped at
PAGE_SIZE. Some discussion here[1][2].

Moving the implementation of post-copy to KVM means that, throughout
post-copy, we can avoid changes to the main mm page tables, and we
only need to modify the second stage page tables. This saves the
memory needed to store the extra set of shattered page tables, and we
save the performance overhead of the page table modifications and
accounting that mm does.

There's some more discussion about these points in David's RFC[3].

[1]: https://lore.kernel.org/linux-mm/20230218002819.1486479-1-jthoughton@xxxxxxxxxx/
[2]: https://lore.kernel.org/linux-mm/ZdcKwK7CXgEsm-Co@x1n/
[3]: https://lore.kernel.org/kvm/CALzav=d23P5uE=oYqMpjFohvn0CASMJxXB_XEOEi-jtqWcFTDA@xxxxxxxxxxxxxx/

>
> >

> > I have intentionally added support for traditional memslots, as the complexity
> > that it adds is minimal, and it is useful for some VMMs, as it can be used to
> > fully implement post-copy live migration.
> >
> > Implementation Details
> > ======================
> >
> > Let's break down how KVM implements each of the three core requirements
> > for implementing post-copy as laid out above:
> >
> > --- Preventing access: KVM_MEMORY_ATTRIBUTE_USERFAULT ---
> >
> > The most straightforward way to inform KVM of userfault-enabled pages is to
> > use a new memory attribute, say KVM_MEMORY_ATTRIBUTE_USERFAULT.
> >
> > There is already infrastructure in place for modifying and checking memory
> > attributes. Using this interface is slightly challenging, as there is no UAPI for
> > setting/clearing particular attributes; we must set the exact attributes we want.
> >
> > The synchronization that is in place for updating memory attributes is not
> > suitable for post-copy live migration either, which will require updating
> > memory attributes (from userfault to no-userfault) very frequently.
> >
> > Another potential interface could be to use something akin to a dirty bitmap,
> > where a bitmap describes which pages within a memslot (or VM) should trigger
> > userfaults. This way, it is straightforward to make updates to the userfault
> > status of a page cheap.
> >
> > When KVM Userfault is enabled, we need to be careful not to map a userfault
> > page in response to a fault on a non-userfault page. In this RFC, I've taken the
> > simplest approach: force new PTEs to be PAGE_SIZE.
> >
> > --- Page fault notifications ---
> >
> > For page faults generated by vCPUs running in guest mode, if the page the
> > vCPU is trying to access is a userfault-enabled page, we use
>
> Why is it necessary to add the per-page control (with uAPIs for VMM to set/clear)?
> Any functional issues if we just have all the page faults exit to userspace during the
> post-copy period?
> - As also mentioned above, userspace can easily know if a page needs to be
>   fetched from the source or not, so upon a fault exit to userspace, VMM can
>   decide to block the faulting vcpu thread or return back to KVM immediately.
> - If improvement is really needed (would need profiling first) to reduce number
>   of exits to userspace, a  KVM internal status (bitmap or xarray) seems sufficient.
>   Each page only needs to exit to userspace once for the purpose of fetching its data
>   from the source in postcopy. It doesn't seem to need userspace to enable the exit
>   again for the page (via a new uAPI), right?

We don't necessarily need a way to go from no-fault -> fault for a
page, that's right[4]. But we do need a way for KVM to be able to
allow the access to proceed (i.e., go from fault -> no-fault). IOW, if
we get a fault and come out to userspace, we need a way to tell KVM
not to do that again. In the case of shared->private conversions, that
mechanism is toggling the memory attributes for a gfn. For
conventional userfaultfd, that's using UFFDIO_COPY/CONTINUE/POISON.
Maybe I'm misunderstanding your question.

[4]: It is helpful for poison emulation for HugeTLB-backed VMs today,
but this is not important.





[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux