Re: Unmapping KVM Guest Memory from Host Kernel

Sean Christopherson <seanjc@xxxxxxxxxx> · Fri, 8 Mar 2024 15:22:50 -0800

On Fri, Mar 08, 2024, James Gowans wrote:
> However, memfd_secret doesn’t work out the box for KVM guest memory; the
> main reason seems to be that the GUP path is intentionally disabled for
> memfd_secret, so if we use a memfd_secret backed VMA for a memslot then
> KVM is not able to fault the memory in. If it’s been pre-faulted in by
> userspace then it seems to work.

Huh, that _shouldn't_ work.  The folio_is_secretmem() in gup_pte_range() is
supposed to prevent the "fast gup" path from getting secretmem pages.

Is this on an upstream kernel?  If so, and if you have bandwidth, can you figure
out why that isn't working?  At the very least, I suspect the memfd_secret
maintainers would be very interested to know that it's possible to fast gup
secretmem.

> There are a few other issues around when KVM accesses the guest memory.
> For example the KVM PV clock code goes directly to the PFN via the
> pfncache, and that also breaks if the PFN is not in the direct map, so
> we’d need to change that sort of thing, perhaps going via userspace
> addresses.
> 
> If we remove the memfd_secret check from the GUP path, and disable KVM’s
> pvclock from userspace via KVM_CPUID_FEATURES, we are able to boot a
> simple Linux initrd using a Firecracker VMM modified to use
> memfd_secret.
> 
> We are also aware of ongoing work on guest_memfd. The current
> implementation unmaps guest memory from VMM address space, but leaves it
> in the kernel’s direct map. We’re not looking at unmapping from VMM
> userspace yet; we still need guest RAM there for PV drivers like virtio
> to continue to work. So KVM’s gmem doesn’t seem like the right solution?

We (and by "we", I really mean the pKVM folks) are also working on allowing
userspace to mmap() guest_memfd[*].  pKVM aside, the long term vision I have for
guest_memfd is to be able to use it for non-CoCo VMs, precisely for the security
and robustness benefits it can bring.

What I am hoping to do with guest_memfd is get userspace to only map memory it
needs, e.g. for emulated/synthetic devices, on-demand.  I.e. to get to a state
where guest memory is mapped only when it needs to be.  More below.

> With this in mind, what’s the best way to solve getting guest RAM out of
> the direct map? Is memfd_secret integration with KVM the way to go, or
> should we build a solution on top of guest_memfd, for example via some
> flag that causes it to leave memory in the host userspace’s page tables,
> but removes it from the direct map? 

100% enhance guest_memfd.  If you're willing to wait long enough, pKVM might even
do all the work for you. :-)

The killer feature of guest_memfd is that it allows the guest mappings to be a
superset of the host userspace mappings.  Most obviously, it allows mapping memory
into the guest without mapping first mapping the memory into the userspace page
tables.  More subtly, it also makes it easier (in theory) to do things like map
the memory with 1GiB hugepages for the guest, but selectively map at 4KiB granularity
in the host.  Or map memory as RWX in the guest, but RO in the host (I don't have
a concrete use case for this, just pointing out it'll be trivial to do once
guest_memfd supports mmap()).

Every attempt to allow mapping VMA-based memory into a guest without it being
accessible by host userspace emory failed; it's literally why we ended up
implementing guest_memfd.  We could teach KVM to do the same with memfd_secret,
but we'd just end up re-implementing guest_memfd.

memfd_secret obviously gets you a PoC much faster, but in the long term I'm quite
sure you'll be fighting memfd_secret all the way.  E.g. it's not dumpable, it
deliberately allocates at 4KiB granularity (though I suspect the bug you found
means that it can be inadvertantly mapped with 2MiB hugepages), it has no line
of sight to taking userspace out of the equation, etc.

With guest_memfd on the other hand, everyone contributing to and maintaining it
has goals that are *very* closely aligned with what you want to do.

[*] https://lore.kernel.org/all/20240222161047.402609-1-tabba@xxxxxxxxxx