On Wed, Aug 07, 2024, James Houghton wrote: > On Thu, Aug 1, 2024 at 3:44 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > > > Early warning for next week's PUCK since there's actually a topic this time. > > James is going to lead a discussion on KVM userfault[*](name subject to change). > > Thanks for attending, everyone! > > We seemed to arrive at the following conclusions: > > 1. For guest_memfd, stage 2 mapping installation will never go through > GUP / virtual addresses to do the GFN --> PFN translation, including > when it supports non-private memory. > 2. Something like KVM Userfault is indeed necessary to handle > post-copy for guest_memfd VMs, especially when guest_memfd supports > non-private memory. > 3. We should not hook into the overall GFN --> HVA translation, we > should only be hooking the GFN --> PFN translation steps to figure out > how to create stage 2 mappings. That is, KVM's own accesses to guest > memory should just go through mm/userfaultfd. > 4. We don't need the concept of "async userfaults" (making KVM block > when attempting to access userfault memory) in KVM Userfault. > > So I need to think more about what exactly the API should look like > for controlling if a page should exit to userspace before KVM is > allowed to map it into stage 2 and if this should apply to all of > guest memory or only guest_memfd. > > It sounds like it may most likely be something like a per-VM bitmap > that describes which pages are allowed to be mapped into stage 2, > applying to all memory, not just guest_memfd memory. Even though it is > solving a problem for guest_memfd specifically, it is slightly cleaner > to have it apply to all memory. > > If this per-VM bitmap applies to all memory, then we don't need to > wait for guest_memfd to support non-private memory before working on a > full implementation. But if not, perhaps it makes sense to wait. Per-memslot likely makes more sense. Unlike attributes, the bitmap only needs to exist during post-copy, and unless we do something clever, i.e. use something other than a bitmap, the bitmap needs to be fully allocated, which would result in unnecessary overhead if there are gaps in guest physical memory. The other hiccup with a per-VM bitmap is that it would force us to define ABI for things we don't care about. E.g. what happens if the local APIC is in-kernel and userspace marks the APIC page as USERFAULT? Ditto for gfns without memslots. E.g. add a KVM_MEM_USERFAULT flag along with a userfault_bitmap user pointer that is valid when the flag is set. Unlike dirty logging, KVM is only a reader of the bitmap, so I'm pretty sure we don't need a copy in KVM. When userspace creates the VM on the target, it allocates a bitmap for each memslot and sets KVM_MEM_USERFAULT. When migration completes, userspace clears KVM_MEM_USERFAULT for each memslot, and then deletes the associated bitmap.