Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 01.09.21 17:54, Andy Lutomirski wrote:
On Wed, Sep 1, 2021, at 1:09 AM, David Hildenbrand wrote:
Do we have to protect from that? How would KVM protect from user space
replacing private pages by shared pages in any of the models we discuss?

The overarching rule is that KVM needs to guarantee a given pfn is never mapped[*]
as both private and shared, where "shared" also incorporates any mapping from the
host.  Essentially it boils down to the kernel ensuring that a pfn is unmapped
before it's converted to/from private, and KVM ensuring that it honors any
unmap notifications from the kernel, e.g. via mmu_notifier or via a direct callback
as proposed in this RFC.

Okay, so the fallocate(PUNCHHOLE) from user space could trigger the
respective unmapping and freeing of backing storage.


As it pertains to PUNCH_HOLE, the responsibilities are no different than when the
backing-store is destroyed; the backing-store needs to notify downstream MMUs
(a.k.a. KVM) to unmap the pfn(s) before freeing the associated memory.

Right.


[*] Whether or not the kernel's direct mapping needs to be removed is debatable,
      but my argument is that that behavior is not visible to userspace and thus
      out of scope for this discussion, e.g. zapping/restoring the direct map can
      be added/removed without impacting the userspace ABI.

Right. Removing it shouldn't also be requited IMHO. There are other ways
to teach the kernel to not read/write some online pages (filter
/proc/kcore, disable hibernation, strict access checks for /dev/mem ...).


Define "ordinary" user memory slots as overlay on top of "encrypted" memory
slots.  Inside KVM, bail out if you encounter such a VMA inside a normal
user memory slot. When creating a "encryped" user memory slot, require that
the whole VMA is covered at creation time. You know the VMA can't change
later.

This can work for the basic use cases, but even then I'd strongly prefer not to
tie memslot correctness to the VMAs.  KVM doesn't truly care what lies behind
the virtual address of a memslot, and when it does care, it tends to do poorly,
e.g. see the whole PFNMAP snafu.  KVM cares about the pfn<->gfn mappings, and
that's reflected in the infrastructure.  E.g. KVM relies on the mmu_notifiers
to handle mprotect()/munmap()/etc...

Right, and for the existing use cases this worked. But encrypted memory
breaks many assumptions we once made ...

I have somewhat mixed feelings about pages that are mapped into $WHATEVER
page tables but not actually mapped into user space page tables. There is no
way to reach these via the rmap.

We have something like that already via vfio. And that is fundamentally
broken when it comes to mmu notifiers, page pinning, page migration, ...

I'm not super familiar with VFIO internals, but the idea with the fd-based
approach is that the backing-store would be in direct communication with KVM and
would handle those operations through that direct channel.

Right. The problem I am seeing is that e.g., try_to_unmap() might not be
able to actually fully unmap a page, because some non-synchronized KVM
MMU still maps a page. It would be great to evaluate how the fd
callbacks would fit into the whole picture, including the current rmap.

I guess I'm missing the bigger picture how it all fits together on the
!KVM side.

The big picture is fundamentally a bit nasty.  Logically (ignoring the implementation details of rmap, mmu_notifier, etc), you can call try_to_unmap and end up with a page that is Just A Bunch Of Bytes (tm).  Then you can write it to disk, memcpy to another node, compress it, etc. When it gets faulted back in, you make sure the same bytes end up somewhere and put the PTEs back.

With guest-private memory, this doesn't work.  Forget about the implementation: you simply can't take a page of private memory, quiesce it so the guest can't access it without faulting, and turn it into Just A Bunch Of Bytes.  TDX does not have that capability.  (Any system with integrity-protected memory won't without having >PAGE_SIZE bytes or otherwise storing metadata, but TDX can't do this at all.)  SEV-ES *can* (I think -- I asked the lead architect), but that's not the same thing as saying it's a good idea.

So if you want to migrate a TDX page from one NUMA node to another, you need to do something different (I don't know all the details), you will have to ask Intel to explain how this might work in the future (it wasn't in the public docs last time I looked), but I'm fairly confident that it does not resemble try_to_unmap().

Even on SEV, where a page *can* be transformed into a Just A Bunch Of Bytes, the operation doesn't really look like try_to_unmap().  As I understand it, it's more of:

look up the one-and-only guest and GPA at which this page is mapped.
tear down the NPT PTE.  (SEV, unlike TDX, uses paging structures in normal memory.)
Ask the magic SEV mechanism to turn the page into swappable data
Go to town.

This doesn't really resemble the current rmap + mmu_notifier code, and shoehorning this into mmu_notifier seems like it may be uglier and more code than implementing it more directly in the backing store.

If you actually just try_to_unmap() a SEV-ES page (and it worked, which it currently won't), you will have data corruption or cache incoherency.

If you want to swap a page on TDX, you can't.  Sorry, go directly to jail, do not collect $200.

So I think there are literally zero code paths that currently call try_to_unmap() that will actually work like that on TDX.  If we run out of memory on a TDX host, we can kill the guest completely and reclaim all of its memory (which probably also involves killing QEMU or whatever other user program is in charge), but that's really our only option.

try_to_unmap() would actually do the right thing I think. It's the users that access page content (migration, swapping) that need additional care to special-case these pages. Completely agree that these would be broken.


--
Thanks,

David / dhildenb





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux