Hey all, This is an overhaul of my RFC [1] for removing guest_memfd folios from the direct map. The goal of this is to back non-confidential VMs using guest_memfd and protect their memory from a large class of speculative execution issues [1, Table 1]. This RFC series is also the basis of my LPC submission [3]. === Changes to v1 === - Drop patches related to userspace mappings to only focus on direct map removal. - Use a refcount to track temporary direct map reinsertions (Paolo/Elliot) - Implement invalidation of gfn_to_pfn_caches holding gmem pfns (David W.) - Do not assume folios have only a single page (Mike R.) === Implementation === This patch series extends guest_memfd to run "non-confidential" VMs that still have their guest memory removed from the host's direct map. "non-confidential" here means that we wish to treat the VM pretty much the same as a VM with traditional, non-guest_memfd memslots: KVM should be able to access non sensitive parts of guest memory such as page tables and MMIO instructions for MMIO emulation, or things like the kvm-clock page, without requiring the guest collaboration. This patch series thus does two things: First introduce a new `KVM_GMEM_NO_DIRECT_MAP` flag to the `KVM_CREATE_GUEST_MEMFD` ioctl that causes guest_memfd to remove its folios from the direct map immediately after allocation. Then, teach key parts of KVM about how to access guest_memfd memory (if the vm type allows it) via temporary restoration of direct map entries. The parts of KVM which we enlighten like this are - kvm_{read,write}_guest and friends (needed for instruction fetch during MMIO emulation) - paging64_walk_addr_generic (needed to resolve GPAs during MMIO emulation) - pfncache.c (needed for kvm-clock) These are a minimal set needed to boot a non-confidential initrd from guest_memfd (provided one finds a way to load such a thing into guest_memfd. My testing was done with an additional commit on this of this series that allows unconditional userspace mappings of guest_memfd). Instruction fetch for MMIO emulation is special here in the sense that it cannot be solved by having the guest explicitly share memory ahead of time, since such conversions in guest_memfd are destructive, and the guest cannot know which instructions will trigger MMIO ahead of time (TDX for example has a special paravirtual solution for this case). It is thus the original motivation for the approach in this patch series. In terms of the proposed framework for allowing both "shared" and "private" memory inside guest_memfd (with in-place conversions supported) [4], memory with its direct map entries removed can be considered "private", while gmem memory with direct map entries can be considered "shared" (I'm afraid this patch series also hasn't found better names than the horribly overloaded "shared" and "private" for this). Implementing support for accessing guest_memfd in kvm_{read,write}_guest is fairly straight forward, as the operation is a simple "remap->access->unmap" sequence that can be completely done while holding the folio lock. However, "long term" accesses such a gfn_to_pfn_caches, which reinsert parts of gmem into the direct map for long periods of time, proved to be tricky to implement, due to the need to respond to gmem invalidation events (to, for example, avoid modifying direct map entries after a folio has been fallocated away and freed). This part is why this series is still an RFC, because my confidence in getting those patches right is fairly low. For what's implemented here, an alternative would be to just have the guest share page tables the kvm-clock pages ahead of time (while keeping the changes to kvm_{read,write}_guest to handle instruction emulation), however I'm not sure this would work for potential future usecases such as nested virtualization (where the L1 guest cannot know ahead of time where the L2 will place page tables, and thus cannot mark them as shared). === Security === We want to use unmapping guest memory from the host kernel as a security mitigation against transient execution attacks. Temporarily restoring direct map entries whenever KVM requires access to guest memory leaves a gap in this mitigation. We believe this to be acceptable for the above cases, since pages used for paravirtual guest/host communication (e.g. kvm-clock) and guest page tables do not contain sensitive data. MMIO emulation will only end up reading pages containing privileged instructions (e.g. guest kernel code). === Summary === Patches 1-3 are about adding the KVM_GMEM_NO_DIRECT_MAP flag, and providing the basic functions needed to on-demand remap guest_memfd folios. Patch 4 deals with kvm_{read,write}_guest. Patches 4 and 5 are about adding the "sharing refcount" framework for supporting long-term direct map restoration of gmem folios. Patches 7-9 integrate guest_memfd with the pfncache machinery. Patch 10 teaches the guest page table walker about guest_memfd. The first few patches are very similar to parts of Elliot's "mm: Introduce guest_memfd library" RFC series [5]. This series focuses on the non-confidential usecase, while Elliot's series focuses more on userspace mappings for confidential VMs. We've had some discussions on how to marry these two in [6]. === ToDos / Limitations === The main question I am looking for feedback on is whether I am on the right path with the "sharing refcount" idea for long-term, KVM-initiated sharing of gmem folios at all, or whether the last few patches look so horrendous that a completely different solution is needed. Other than that, the patches are of course still missing selftests. Best, Patrick [1]: https://lore.kernel.org/kvm/20240709132041.3625501-1-roypat@xxxxxxxxxxxx/T/#mf6eb2d36bab802da411505f46ba154885cb207e6 [2]: https://download.vusec.net/papers/quarantine_raid23.pdf [3]: https://lpc.events/event/18/contributions/1763/ [4]: https://lore.kernel.org/linux-mm/BN9PR11MB5276D7FAC258CFC02F75D0648CB32@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/T/#mc944a6fdcd20a35f654c2be99f9c91a117c1bed4 [5]: https://lore.kernel.org/kvm/20240829-guest-memfd-lib-v2-0-b9afc1ff3656@xxxxxxxxxxx/T/#mbcf942dcccc3726921743251d07b1a3a7e711d3f [6]: https://lore.kernel.org/kvm/20240805-guest-memfd-lib-v1-0-e5a29a4ff5d7@xxxxxxxxxxx/T/#m785c2c1731be216fd0f6aa4c22d8b4aab146f3c1 Patrick Roy (10): kvm: gmem: Add option to remove gmem from direct map kvm: gmem: Add KVM_GMEM_GET_PFN_SHARED kvm: gmem: Add KVM_GMEM_GET_PFN_LOCKED kvm: Allow reading/writing gmem using kvm_{read,write}_guest kvm: gmem: Refcount internal accesses to gmem kvm: gmem: add tracepoints for gmem share/unshare kvm: pfncache: invalidate when memory attributes change kvm: pfncache: Support caching gmem pfns kvm: pfncache: hook up to gmem invalidation kvm: x86: support walking guest page tables in gmem arch/x86/kvm/mmu/mmu.c | 2 +- arch/x86/kvm/mmu/paging_tmpl.h | 95 ++++++++++++--- include/linux/kvm_host.h | 17 ++- include/linux/kvm_types.h | 2 + include/trace/events/kvm.h | 43 +++++++ include/uapi/linux/kvm.h | 2 + virt/kvm/guest_memfd.c | 216 ++++++++++++++++++++++++++++++--- virt/kvm/kvm_main.c | 91 ++++++++++++++ virt/kvm/kvm_mm.h | 12 ++ virt/kvm/pfncache.c | 144 ++++++++++++++++++++-- 10 files changed, 579 insertions(+), 45 deletions(-) base-commit: 332d2c1d713e232e163386c35a3ba0c1b90df83f -- 2.46.0