Hi all, On 3/9/24 11:14, Mike Rapoport wrote: >>> >>> With this in mind, what’s the best way to solve getting guest RAM out of >>> >>> the direct map? Is memfd_secret integration with KVM the way to go, or >>> >>> should we build a solution on top of guest_memfd, for example via some >>> >>> flag that causes it to leave memory in the host userspace’s page tables, >>> >>> but removes it from the direct map? >> >> memfd_secret obviously gets you a PoC much faster, but in the long term I'm quite >> >> sure you'll be fighting memfd_secret all the way. E.g. it's not dumpable, it >> >> deliberately allocates at 4KiB granularity (though I suspect the bug you found >> >> means that it can be inadvertantly mapped with 2MiB hugepages), it has no line >> >> of sight to taking userspace out of the equation, etc. >> >> >> >> With guest_memfd on the other hand, everyone contributing to and maintaining it >> >> has goals that are *very* closely aligned with what you want to do. > > I agree with Sean, guest_memfd seems a better interface to use. It's > > integrated by design with KVM and removing guest memory from the direct map > > looks like a natural enhancement to guest_memfd. > > > > Unless I'm missing something, for fast-and-dirty POC it'll be a oneliner > > that adds set_memory_np() to kvm_gmem_get_folio() and then figuring out > > what to do with virtio :) We’ve been playing around with extending guest_memfd to remove guest memory from the direct map. Removal from direct map aspect is indeed fairly straight-forward; since we cannot map guest_memfd, we don’t need to worry about folios without direct map entries getting to places where they will cause kernel panics. However, we ran into problems running non-CoCo VMs with guest_memfd for guest memory, independent of direct map entries being available or not. There’s a handful of places where a traditional KVM / Userspace setup currently touches guest memory: * Loading the Guest Kernel into guest-owned memory * Instruction fetch from arbitrary guest addresses and guest page table walks for MMIO emulation (for example for IOAPIC accesses) * kvm-clock * I/O devices With guest_memfd, if the guest is running from guest-private memory, these need to be rethought, since now the memory is unavailable to userspace, and KVM is not enlightened about guest_memfd’s existance everywhere (when I was experimenting with this, it generally read garbage data from the shared VMA, but I think I’ve since seen some patches floating around that would make it return -EFAULT instead). CoCo VMs have various methods for working around these: You load a guest kernel using some “populate on first access” mechanism [1], kvm-clock and I/O is solved by having the guest mark the relevant address ranges as “shared” ahead of time [2] and bounce buffering via swiotlb [4], and Intel TDX solves the instruction emulation problem for MMIO by injecting a #VE and having the guest do the emulation itself [3]. For non-CoCo VMs, where memory is not encrypted, and the threat model assumes a trusted host userspace, we would like to avoid changing the VM model so completely. If we adopt CoCo’s approaches where KVM / Userspace touches guest memory we would get all the complexity, yet none of the encryption. Particularly the complexity on the MMIO path seems nasty, but x86 does not pre-decode instructions on MMIO exits (which are just EPT_VIOLATIONs) like it does for PIO exits, so I also don’t really see a way around it in the guest_memfd model. We’ve played around a lot with allowing userspace mappings of guest_memfd, and then having KVM internally access guest_memfd via userspace page tables (and came up with multiple hacky ways to boot simple Linux initrds from guest_memfd), but this is fairly awkward for two reasons: 1. Now lots of codepaths in KVM end up accessing guest_memfd, which from my understanding goes against the guest_memfd goal of making machine checks because of incorrect accesses to TDX memory impossible, and 2. We need to somehow get a userspace mapping of guest_memfd into KVM (a hacky way I could make this work was setting up kvm_user_memory_region2 with userspace_addr set to a mmap of guest_memory, which actually "works" for everything but kvm-clock, but I also realized later that this is just memfd_secret with extra steps). We also played around with having KVM access guest_memfd through the direct map (by temporarily reinserting pages into it when needed), but this again means lots of KVM code learns about how to access guest RAM via guest_memfd. There are a few other features we need to support, such as serving page faults using UFFD, which we are not too sure how to realize with guest_memfd since UFFD is VMA based (although to me some sort of “UFFD-for-FD” sounds like something that’d be useful even outside of our guest_memfd usecase). With these challenges in mind, some variant of memfd_secret continues to look attractive for the non-CoCo case. Perhaps a variant that supports in-kernel faults and provides some way for gfn_to_pfn_cache users like kvm-clock to restore the direct map entries. Sean, you mentioned that you envision guest_memfd also supporting non-CoCo VMs. Do you have some thoughts about how to make the above cases work in the guest_memfd context? > > -- > > Sincerely yours, > > Mike. Best, Patrick [1]: https://lore.kernel.org/kvm/20240404185034.3184582-1-pbonzini@xxxxxxxxxx/T/#m4cc08ce3142a313d96951c2b1286eb290c7d1dac [2]: https://elixir.bootlin.com/linux/latest/source/arch/x86/kernel/kvmclock.c#L227 [3]: https://www.kernel.org/doc/html/next/x86/tdx.html#mmio-handling [4]: https://www.kernel.org/doc/html/next/x86/tdx.html#shared-memory-conversions