Re: Unmapping KVM Guest Memory from Host Kernel

Patrick Roy <roypat@xxxxxxxxxxxx> · Mon, 13 May 2024 11:31:47 +0100

Hi all,

On 3/9/24 11:14, Mike Rapoport wrote:

>>> >>> With this in mind, what’s the best way to solve getting guest RAM out of
>>> >>> the direct map? Is memfd_secret integration with KVM the way to go, or
>>> >>> should we build a solution on top of guest_memfd, for example via some
>>> >>> flag that causes it to leave memory in the host userspace’s page tables,
>>> >>> but removes it from the direct map?
>> >> memfd_secret obviously gets you a PoC much faster, but in the long term I'm quite
>> >> sure you'll be fighting memfd_secret all the way.  E.g. it's not dumpable, it
>> >> deliberately allocates at 4KiB granularity (though I suspect the bug you found
>> >> means that it can be inadvertantly mapped with 2MiB hugepages), it has no line
>> >> of sight to taking userspace out of the equation, etc.
>> >>
>> >> With guest_memfd on the other hand, everyone contributing to and maintaining it
>> >> has goals that are *very* closely aligned with what you want to do.
> > I agree with Sean, guest_memfd seems a better interface to use. It's
> > integrated by design with KVM and removing guest memory from the direct map
> > looks like a natural enhancement to guest_memfd.
> >
> > Unless I'm missing something, for fast-and-dirty POC it'll be a oneliner
> > that adds set_memory_np() to kvm_gmem_get_folio() and then figuring out
> > what to do with virtio :)

We’ve been playing around with extending guest_memfd to remove guest memory
from the direct map. Removal from direct map aspect is indeed fairly
straight-forward; since we cannot map guest_memfd, we don’t need to worry about
folios without direct map entries getting to places where they will cause
kernel panics.

However, we ran into problems running non-CoCo VMs with guest_memfd for guest
memory, independent of direct map entries being available or not. There’s a
handful of places where a traditional KVM / Userspace setup currently touches
guest memory:

* Loading the Guest Kernel into guest-owned memory
* Instruction fetch from arbitrary guest addresses and guest page table walks  
  for MMIO emulation (for example for IOAPIC accesses)
* kvm-clock
* I/O devices

With guest_memfd, if the guest is running from guest-private memory, these need
to be rethought, since now the memory is unavailable to userspace, and KVM is
not enlightened about guest_memfd’s existance everywhere (when I was
experimenting with this, it generally read garbage data from the shared VMA,
but I think I’ve since seen some patches floating around that would make it
return -EFAULT instead).

CoCo VMs have various methods for working around these: You load a guest kernel
using some “populate on first access” mechanism [1], kvm-clock and I/O is
solved by having the guest mark the relevant address ranges as “shared” ahead
of time [2] and bounce buffering via swiotlb [4], and Intel TDX solves the
instruction emulation problem for MMIO by injecting a #VE and having the guest
do the emulation itself [3].

For non-CoCo VMs, where memory is not encrypted, and the threat model assumes a
trusted host userspace, we would like to avoid changing the VM model so
completely. If we adopt CoCo’s approaches where KVM / Userspace touches guest
memory we would get all the complexity, yet none of the encryption.
Particularly the complexity on the MMIO path seems nasty, but x86 does not
pre-decode instructions on MMIO exits (which are just EPT_VIOLATIONs) like it
does for PIO exits, so I also don’t really see a way around it in the
guest_memfd model.

We’ve played around a lot with allowing userspace mappings of guest_memfd, and
then having KVM internally access guest_memfd via userspace page tables (and
came up with multiple hacky ways to boot simple Linux initrds from
guest_memfd), but this is fairly awkward for two reasons:

1. Now lots of codepaths in KVM end up accessing guest_memfd, which from my
understanding goes against the guest_memfd goal of making machine checks
because of incorrect accesses to TDX memory impossible, and
2. We need to somehow get a userspace mapping of guest_memfd into KVM (a hacky
way I could make this work was setting up kvm_user_memory_region2 with
userspace_addr set to a mmap of guest_memory, which actually "works" for
everything but kvm-clock, but I also realized later that this is just
memfd_secret with extra steps).

We also played around with having KVM access guest_memfd through the direct map
(by temporarily reinserting pages into it when needed), but this again means
lots of KVM code learns about how to access guest RAM via guest_memfd.

There are a few other features we need to support, such as serving page faults
using UFFD, which we are not too sure how to realize with guest_memfd since
UFFD is VMA based (although to me some sort of “UFFD-for-FD” sounds like
something that’d be useful even outside of our guest_memfd usecase).

With these challenges in mind, some variant of memfd_secret continues to look
attractive for the non-CoCo case. Perhaps a variant that supports in-kernel
faults and provides some way for gfn_to_pfn_cache users like kvm-clock to
restore the direct map entries.

Sean, you mentioned that you envision guest_memfd also supporting non-CoCo VMs.
Do you have some thoughts about how to make the above cases work in the
guest_memfd context?

> > --
> > Sincerely yours,
> > Mike.

Best,
Patrick

[1]: https://lore.kernel.org/kvm/20240404185034.3184582-1-pbonzini@xxxxxxxxxx/T/#m4cc08ce3142a313d96951c2b1286eb290c7d1dac
[2]: https://elixir.bootlin.com/linux/latest/source/arch/x86/kernel/kvmclock.c#L227
[3]: https://www.kernel.org/doc/html/next/x86/tdx.html#mmio-handling
[4]: https://www.kernel.org/doc/html/next/x86/tdx.html#shared-memory-conversions