This patch series implements the KVM-based demand paging system that was first introduced back in November[1] by David Matlack. The working name for this new system is KVM Userfault, but that name is very confusing so it will not be the final name. Problem: post-copy with guest_memfd =================================== Post-copy live migration makes it possible to migrate VMs from one host to another no matter how fast they are writing to memory while keeping the VM paused for a minimal amount of time. For post-copy to work, we need: 1. to be able to prevent KVM from being able to access particular pages of guest memory until we have populated it 2. for userspace to know when KVM is trying to access a particular page. 3. a way to allow the access to proceed. Traditionally, post-copy live migration is implemented using userfaultfd, which hooks into the main mm fault path. KVM hits this path when it is doing HVA -> PFN translations (with GUP) or when it itself attempts to access guest memory. Userfaultfd sends a page fault notification to userspace, and KVM goes to sleep. Userfaultfd works well, as it is not specific to KVM; everyone who attempts to access guest memory will block the same way. However, with guest_memfd, we do not use GUP to translate from GFN to HPA (nor is there an intermediate HVA). So userfaultfd in its current form cannot be used to support post-copy live migration with guest_memfd-backed VMs. Solution: hook into the gfn -> pfn translation ============================================== The only way to implement post-copy with a non-KVM-specific userfaultfd-like system would be to introduce the concept of a file-userfault[2] to intercept faults on a guest_memfd. Instead, we take the simpler approach of adding a KVM-specific API, and we hook into the GFN -> HVA or GFN -> PFN translation steps (for traditional memslots and for guest_memfd respectively). I have intentionally added support for traditional memslots, as the complexity that it adds is minimal, and it is useful for some VMMs, as it can be used to fully implement post-copy live migration. Implementation Details ====================== Let's break down how KVM implements each of the three core requirements for implementing post-copy as laid out above: --- Preventing access: KVM_MEMORY_ATTRIBUTE_USERFAULT --- The most straightforward way to inform KVM of userfault-enabled pages is to use a new memory attribute, say KVM_MEMORY_ATTRIBUTE_USERFAULT. There is already infrastructure in place for modifying and checking memory attributes. Using this interface is slightly challenging, as there is no UAPI for setting/clearing particular attributes; we must set the exact attributes we want. The synchronization that is in place for updating memory attributes is not suitable for post-copy live migration either, which will require updating memory attributes (from userfault to no-userfault) very frequently. Another potential interface could be to use something akin to a dirty bitmap, where a bitmap describes which pages within a memslot (or VM) should trigger userfaults. This way, it is straightforward to make updates to the userfault status of a page cheap. When KVM Userfault is enabled, we need to be careful not to map a userfault page in response to a fault on a non-userfault page. In this RFC, I've taken the simplest approach: force new PTEs to be PAGE_SIZE. --- Page fault notifications --- For page faults generated by vCPUs running in guest mode, if the page the vCPU is trying to access is a userfault-enabled page, we use KVM_EXIT_MEMORY_FAULT with a new flag: KVM_MEMORY_EXIT_FLAG_USERFAULT. For arm64, I believe this is actually all we need, provided we handle steal_time properly. For x86, where returning from deep within the instruction emulator (or other non-trivial execution paths) is infeasible, being able to pause execution while userspace fetches the page, just as userfaultfd would do, is necessary. Let's call these "asynchronous userfaults." A new ioctl, KVM_READ_USERFAULT, has been added to read asynchronous userfaults, and an eventfd is used to signal that new faults are available for reading. Today, we busy-wait for a gfn to have userfault disabled. This will change in the future. --- Fault resolution --- Resolving userfaults today is as simple as removing the USERFAULT memory attribute on the faulting gfn. This will change if we do not end up using memory attributes for KVM Userfault. Having a range-based wake-up like userfaultfd (see UFFDIO_WAKE) might also be helpful for performance. Problems with this series ========================= - This cannot be named KVM Userfault! Perhaps "KVM missing pages"? - Memory attribute modification doesn't scale well. - We busy-wait for pages to not be userfault-enabled. - gfn_to_hva and gfn_to_pfn caches are not invalidated. - Page tables are not collapsed when KVM Userfault is disabled. - There is no self-test for asynchronous userfaults. - Asynchronous page faults can be dropped if KVM_READ_USERFAULT fails. - Supports only x86 and arm64. - Probably many more! Thanks! [1]: https://lore.kernel.org/kvm/CALzav=d23P5uE=oYqMpjFohvn0CASMJxXB_XEOEi-jtqWcFTDA@xxxxxxxxxxxxxx/ [2]: https://lore.kernel.org/kvm/CADrL8HVwBjLpWDM9i9Co1puFWmJshZOKVu727fMPJUAbD+XX5g@xxxxxxxxxxxxxx/ James Houghton (18): KVM: Add KVM_USERFAULT build option KVM: Add KVM_CAP_USERFAULT and KVM_MEMORY_ATTRIBUTE_USERFAULT KVM: Put struct kvm pointer in memslot KVM: Fail __gfn_to_hva_many for userfault gfns. KVM: Add KVM_PFN_ERR_USERFAULT KVM: Add KVM_MEMORY_EXIT_FLAG_USERFAULT KVM: Provide attributes to kvm_arch_pre_set_memory_attributes KVM: x86: Add KVM Userfault support KVM: x86: Add vCPU fault fast-path for Userfault KVM: arm64: Add KVM Userfault support KVM: arm64: Add vCPU memory fault fast-path for Userfault KVM: arm64: Add userfault support for steal-time KVM: Add atomic parameter to __gfn_to_hva_many KVM: Add asynchronous userfaults, KVM_READ_USERFAULT KVM: guest_memfd: Add KVM Userfault support KVM: Advertise KVM_CAP_USERFAULT in KVM_CHECK_EXTENSION KVM: selftests: Add KVM Userfault mode to demand_paging_test KVM: selftests: Remove restriction in vm_set_memory_attributes Documentation/virt/kvm/api.rst | 23 ++ arch/arm64/include/asm/kvm_host.h | 2 +- arch/arm64/kvm/Kconfig | 1 + arch/arm64/kvm/arm.c | 8 +- arch/arm64/kvm/mmu.c | 45 +++- arch/arm64/kvm/pvtime.c | 11 +- arch/x86/kvm/Kconfig | 1 + arch/x86/kvm/mmu/mmu.c | 67 +++++- arch/x86/kvm/mmu/mmu_internal.h | 3 +- include/linux/kvm_host.h | 41 +++- include/uapi/linux/kvm.h | 13 ++ .../selftests/kvm/demand_paging_test.c | 46 +++- .../testing/selftests/kvm/include/kvm_util.h | 7 - virt/kvm/Kconfig | 4 + virt/kvm/guest_memfd.c | 16 +- virt/kvm/kvm_main.c | 213 +++++++++++++++++- 16 files changed, 457 insertions(+), 44 deletions(-) base-commit: 02b0d3b9d4dd1ef76b3e8c63175f1ae9ff392313 -- 2.45.2.993.g49e7a77208-goog