Hi Paolo, I'd like your feedback on whether you would merge a KVM-specific alternative to UserfaultFD. Within Google we have a feature called "KVM Demand Paging" that we have been using for post-copy live migration since 2014 and memory poisoning emulation more recently. The high-level design is: (a) A bitmap that tracks which GFNs are present, along with a UAPI to enable/disable the present bitmap. (b) UAPIs for marking GFNs present and non-present. (c) KVM_RUN support for returning to userspace on guest page faults to non-present GFNs. (d) A notification mechanism and wait queue to coordinate KVM accesses to non-present GFNs. (e) UAPI or KVM policy for collapsing SPTEs into huge pages as guest memory becomes present. The actual implementation within Google has a lot of warts that I won't get into... but I think we could have a pretty clean upstream solution. In fact, a lot of the infrastructure needed to support this design is already in-flight upstream. e.g. (a) and (b) could be built on top of the new memory attributes (although I have concerns about the performance of using xarray vs. bitmaps), (c) can be built on top of the memory-fault exiting. The most complex piece of new code would be the notification mechanism for (d). Within Google we've been using a netlink socket, but I think we should use a custom file descriptor instead. If we do it right, almost no architecture-specific support is needed. Just a small bit in the page fault path (for (c) and to account for the present bitmap when determining what (huge)page size to map). The most painful part of carrying KVM Demand Paging out-of-tree has been maintaining the hooks for (d). But this has been mostly self-inflicted. We started out by manually annotating all of the code where KVM reads/writes guest memory. But there are more core routines that all guest-memory accesses go through (e.g. __gfn_to_hva_many()) where we could put a single hook, and then KVM just has to make sure to invalidate an gfn-to-hva/pfn caches and SPTEs when a page becomes non-present (which is rare and typically only happens before a vCPU starts running). And hooking KVM accesses to guest memory isn't exactly new, KVM already manually tracks all writes to keep the dirty log up to date. So why merge a KVM-specific alternative to UserfaultFD? Taking a step back, let's look at what UserfaultFD is actually providing for KVM VMs: 1. Coordination of userspace accesses to guest memory. 2. Coordination of KVM+guest accesses to guest memory. (1.) technically does not need kernel support. It's possible to solve this problem in userspace, and likely can be more efficient to solve it in userspace because you have more flexibility and can avoid bouncing through the kernel page fault handler. And it's not unreasonable to expect VMMs to support this. VMMs already need to manually intercept userspace _writes_ to guest memory to implement dirty tracking efficiently. It's a small step beyond that to intercept both reads and writes for post-copy. And VMMs are increasingly multi-process. UserfaultFD provides coordination within a process but VMMs already need to deal with coordinating across processes already. i.e. UserfaultFD is only solving part of the problem for (1.). The KVM-specific approach is basically to provide kernel support for (2) and let userspace solve (1) however it likes. But if UserfaultFD solves (1) and (2), why introduce a KVM feature that solves only (2)? Well, UserfaultFD has some very real downsides: * Lack of sufficient HugeTLB Support: The most recent and glaring problem is upstream's NACK of HugeTLB High Granularity Mapping [1]. Without HGM, UserfaultFD can only handle HugeTLB faults at huge page granularity. i.e. If a VM is backed with 1GiB HugeTLB, then UserfaultFD can only handle 1GiB faults. Demand-fetching 1GiB of memory from a remote host during the post-copy phase of live migration is untenable. Even 2MiB fetches are painful with most current NICs. In effect, there is no line-of-sight on an upstream solution for post-copy live migration for VMs backed with HugeTLB. * Memory Overhead: UserfaultFD requires an extra 8 bytes per page of guest memory for the userspace page table entries. * CPU Overhead: UserfaultFD has to manipulate userspace page tables to split mappings down to PAGE_SIZE, handle PAGE_SIZE'd faults, and, later, collapse mappings back into huge pages. These manipulations take locks like mmap_lock, page locks, and page table locks. * Complexity: UserfaultFD-based demand paging depends on functionality across multiple subsystems in the kernel including Core MM, KVM, as well as the each of the memory filesystems (tmpfs, HugeTLB, and eventually guest_memfd). Debugging problems requires knowledge across many domains that many engineers do not have. And solving problems requires getting buy-in from multiple subsystem maintainers that may not all be aligned (see: HGM). All of these are addressed with a KVM-specific solution. A KVM-specific solution can have: * Transparent support for any backing memory subsystem (tmpfs, HugeTLB, and even guest_memfd). * Only 1 bit of overhead per page of guest memory. * No need to modify host page tables. * All code contained within KVM. * Significantly fewer LOC than UserfaultFD. Ok, that's the pitch. What are your thoughts? [1] https://lore.kernel.org/linux-mm/20230218002819.1486479-1-jthoughton@xxxxxxxxxx/