Hey all, This RFC series is a rough draft adding support for running non-confidential compute VMs in guest_memfd, based on prior discussions with Sean [1]. Our specific usecase for this is the ability to unmap guest memory from the host kernel's direct map, as a mitigation against a large class of speculative execution issues. === Implementation === This patch series introduces a new flag to the `KVM_CREATE_GUEST_MEMFD` to remove its pages from the direct map when they are allocated. When trying to run a guest from such a VM, we now face the problem that without either userspace or kernelspace mappings of guest_memfd, KVM cannot access guest memory to, for example, do MMIO emulation of access memory used to guest/host communication. We have multiple options for solving this when running non-CoCo VMs: (1) implement a TDX-light solution, where the guest shares memory that KVM needs to access, and relies on paravirtual solutions where this is not possible (e.g. MMIO), (2) have KVM use userspace mappings of guest_memfd (e.g. a memfd_secret-style solution), or (3) dynamically reinsert pages into the direct map whenever KVM wants to access them. This RFC goes for option (3). Option (1) is a lot of overhead for very little gain, since we are not actually constrained by a physical inability to access guest memory (e.g. we are not in a TDX context where accesses to guest memory cause a #MC). Option (2) has previously been rejected [1]. In this patch series, we make sufficient parts of KVM gmem-aware to be able to boot a Linux initrd from private memory on x86. These include KVM's MMIO emulation (including guest page table walking) and kvm-clock. For VM types which do not allow accessing gmem, we return -EFAULT and attempt to prepare a KVM_EXIT_MEMORY_FAULT. Additionally, this patch series adds support for "restricted" userspace mappings of guest_memfd, which work similar to memfd_secret (e.g. disallow get_user_pages), which allows handling I/O and loading the guest kernel in a simple way. Support for this is completely independent of the rest of the functionality introduced in this patch series. However, it is required to build a minimal hypervisor PoC that actually allows booting a VM from a disk. === Performance === We have run some preliminary performance benchmarks to assess the impact of on-the-fly direct map manipulations. We were mainly interested in the impact of manipulating the direct map for MMIO emulation on virtio-mmio. Particularly, we were worried about the impact of the TLB and L1/2/3 Cache flushes that set_memory_[n]p entails. In our setup, we have taken a modified Firecracker VMM, spawned a Linux guest with 1 vCPU, and used fio to stress a virtio_blk device. We found that the cache flushes caused throughput to drop from around 600MB/s to ~50MB/s (~90%) for both reads and writes (on a Intel(R) Xeon(R) Platinum 8375C CPU with 64 cores). We then converted our prototype to use set_direct_map_{invalid,default}_noflush instead of set_memory_[n]p and found that without cache flushes the pure impact of the direct map manipulation is indistinguishable from noise. This is why we use set_direct_map_{invalid,default}_noflush instead of set_memory_[n]p in this RFC. Note that in this comparison, both the baseline, as well as the guest_memfd-supporting version of Firecracker were made to bounce I/O buffers in VMM userspace. As GUP is disabled for the guest_memfd VMAs, the virtio stack cannot directly pass guest buffers to read/write syscalls. === Security === We want to use unmapping guest memory from the host kernel as a security mitigation against transient execution attacks. Temporarily restoring direct map entries whenever KVM requires access to guest memory leaves a gap in this mitigation. We believe this to be acceptable for the above cases, since pages used for paravirtual guest/host communication (e.g. kvm-clock) and guest page tables do not contain sensitive data. MMIO emulation will only end up reading pages containing privileged instructions (e.g. guest kernel code). === Summary === Patches 1-4 are about hot-patching various points inside of KVM that access guest memory to correctly handle the case where memory happens to be guest-private. This means either handling the access as a memory error, or simply accessing the memslot's guest_memfd instead of looking at the userspace provided VMA if the VM type allows these kind of accesses. Patches 5-6 add a flag to KVM_CREATE_GUEST_MEMFD that will make it remove its pages from the kernel's direct map. Whenever KVM wants to access guest-private memory, it will temporarily re-insert the relevant pages. Patches 7-8 allow for restricted userspace mappings (e.g. get_user_pages paths are disabled like for memfd_secret) of guest_memfd, so that userspace has an easy path for loading the guest kernel and handling I/O-buffers. === ToDos / Limitations === There are still a few rough edges that need to be addressed before dropping the "RFC" tag, e.g. * Handle errors of set_direct_map_default_not_flush in kvm_gmem_invalidate_folio instead of calling BUG_ON * Lift the limitation of "at most one gfn_to_pfn_cache for each gfn/pfn" in e1c61f0a7963 ("kvm: gmem: Temporarily restore direct map entries when needed"). It currently means that guests with more than 1 vcpu fail to boot, because multiple vcpus can put their kvm-clock PV structures into the same page (gfn) * Write selftests, particularly around hole punching, direct map removal, and mmap. Lastly, there's the question of nested virtualization which Sean brought up in previous discussions, which runs into similar problems as MMIO. I have looked at it very briefly. On Intel, KVM uses various gfn->uhva caches, which run in similar problems as the gfn_to_hva_caches dealt with in 200834b15dda ("kvm: use slowpath in gfn_to_hva_cache if memory is private"). However, previous attempts at just converting this to gfn_to_pfn_cache (which would make them work with guest_memfd) proved complicated [2]. I suppose initially, we should probably disallow nested virtualization in VMs that have their memory removed from the direct map. Best, Patrick [1]: https://lore.kernel.org/linux-mm/cc1bb8e9bc3e1ab637700a4d3defeec95b55060a.camel@xxxxxxxxxx/ [2]: https://lore.kernel.org/kvm/ZBEEQtmtNPaEqU1i@xxxxxxxxxx/ Patrick Roy (8): kvm: Allow reading/writing gmem using kvm_{read,write}_guest kvm: use slowpath in gfn_to_hva_cache if memory is private kvm: pfncache: enlighten about gmem kvm: x86: support walking guest page tables in gmem kvm: gmem: add option to remove guest private memory from direct map kvm: gmem: Temporarily restore direct map entries when needed mm: secretmem: use AS_INACCESSIBLE to prohibit GUP kvm: gmem: Allow restricted userspace mappings arch/x86/kvm/mmu/paging_tmpl.h | 94 +++++++++++++++++++----- include/linux/kvm_host.h | 5 ++ include/linux/kvm_types.h | 1 + include/linux/secretmem.h | 13 +++- include/uapi/linux/kvm.h | 2 + mm/secretmem.c | 6 +- virt/kvm/guest_memfd.c | 83 +++++++++++++++++++-- virt/kvm/kvm_main.c | 112 +++++++++++++++++++++++++++- virt/kvm/pfncache.c | 130 +++++++++++++++++++++++++++++---- 9 files changed, 399 insertions(+), 47 deletions(-) base-commit: 890a64810d59b1a58ed26efc28cfd821fc068e84 -- 2.45.2