Re: [ANNOUNCE] PUCK Agenda - 2024.08.07 - KVM userfault (guest_memfd/HugeTLB postcopy)

Sean Christopherson <seanjc@xxxxxxxxxx> · Wed, 7 Aug 2024 17:17:45 -0700

On Wed, Aug 07, 2024, James Houghton wrote:
> On Thu, Aug 1, 2024 at 3:44 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> >
> > Early warning for next week's PUCK since there's actually a topic this time.
> > James is going to lead a discussion on KVM userfault[*](name subject to change).
> 
> Thanks for attending, everyone!
> 
> We seemed to arrive at the following conclusions:
> 
> 1. For guest_memfd, stage 2 mapping installation will never go through
> GUP / virtual addresses to do the GFN --> PFN translation, including
> when it supports non-private memory.
> 2. Something like KVM Userfault is indeed necessary to handle
> post-copy for guest_memfd VMs, especially when guest_memfd supports
> non-private memory.
> 3. We should not hook into the overall GFN --> HVA translation, we
> should only be hooking the GFN --> PFN translation steps to figure out
> how to create stage 2 mappings. That is, KVM's own accesses to guest
> memory should just go through mm/userfaultfd.
> 4. We don't need the concept of "async userfaults" (making KVM block
> when attempting to access userfault memory) in KVM Userfault.
> 
> So I need to think more about what exactly the API should look like
> for controlling if a page should exit to userspace before KVM is
> allowed to map it into stage 2 and if this should apply to all of
> guest memory or only guest_memfd.
> 
> It sounds like it may most likely be something like a per-VM bitmap
> that describes which pages are allowed to be mapped into stage 2,
> applying to all memory, not just guest_memfd memory. Even though it is
> solving a problem for guest_memfd specifically, it is slightly cleaner
> to have it apply to all memory.
> 
> If this per-VM bitmap applies to all memory, then we don't need to
> wait for guest_memfd to support non-private memory before working on a
> full implementation. But if not, perhaps it makes sense to wait.

Per-memslot likely makes more sense.  Unlike attributes, the bitmap only needs
to exist during post-copy, and unless we do something clever, i.e. use something
other than a bitmap, the bitmap needs to be fully allocated, which would result
in unnecessary overhead if there are gaps in guest physical memory.

The other hiccup with a per-VM bitmap is that it would force us to define ABI
for things we don't care about.  E.g. what happens if the local APIC is in-kernel
and userspace marks the APIC page as USERFAULT?  Ditto for gfns without memslots.

E.g. add a KVM_MEM_USERFAULT flag along with a userfault_bitmap user pointer
that is valid when the flag is set.  Unlike dirty logging, KVM is only a reader
of the bitmap, so I'm pretty sure we don't need a copy in KVM.

When userspace creates the VM on the target, it allocates a bitmap for each
memslot and sets KVM_MEM_USERFAULT.  When migration completes, userspace clears
KVM_MEM_USERFAULT for each memslot, and then deletes the associated bitmap.