Re: [PATCH RFC gmem v1 4/8] KVM: x86: Add gmem hook for invalidating memory

Sean Christopherson <seanjc@xxxxxxxxxx> · Fri, 9 Feb 2024 06:28:04 -0800

On Fri, Feb 09, 2024, Steven Price wrote:
> On 16/10/2023 12:50, Michael Roth wrote:
> > In some cases, like with SEV-SNP, guest memory needs to be updated in a
> > platform-specific manner before it can be safely freed back to the host.
> > Wire up arch-defined hooks to the .free_folio kvm_gmem_aops callback to
> > allow for special handling of this sort when freeing memory in response
> > to FALLOC_FL_PUNCH_HOLE operations and when releasing the inode, and go
> > ahead and define an arch-specific hook for x86 since it will be needed
> > for handling memory used for SEV-SNP guests.
> 
> Hi all,
> 
> Arm CCA has a similar need to prepare/unprepare memory (granule
> delegate/undelegate using our terminology) before it is used for
> protected memory.
> 
> However I see a problem with the current gmem implementation that the
> "invalidations" are not precise enough for our RMI API. When punching a
> hole in the memfd the code currently hits the same path (ending in
> kvm_unmap_gfn_range()) as if a VMA is modified in the same range (for
> the shared version).
>
> The Arm CCA architecture doesn't allow the protected memory to be removed and
> refaulted without the permission of the guest (the memory contents would be
> wiped in this case).

TDX behaves almost exactly like CCA.  Well, that's not technically true, strictly
speaking, as there are TDX APIs that do allow for *temporarily* marking mappings
!PRESENT, but those aren't in play for invalidation events like this.

SNP does allow zapping page table mappings, but fully removing a page, as PUNCH_HOLE
would do, is destructive, so SNP also behaves the same way for all intents and
purposes.

> One option that I've considered is to implement a seperate CCA ioctl to
> notify KVM whether the memory should be mapped protected.

That's what KVM_SET_MEMORY_ATTRIBUTES+KVM_MEMORY_ATTRIBUTE_PRIVATE is for, no?

> The invalidations would then be ignored on ranges that are currently
> protected for this guest.

That's backwards.  Invalidations on a guest_memfd should affect only *protected*
mappings.  And for that, the plan/proposal is to plumb only_{shared,private} flags
into "struct kvm_gfn_range"[1] so that guest_memfd invalidations don't zap shared
mappings, and mmu_notifier invalidation don't zap private mappings.  Sample usage
in the TDX context[2] (disclaimer, I'm pretty sure I didn't write most of that
patch despite, I only provided a rough sketch).

[1] https://lore.kernel.org/all/20231027182217.3615211-13-seanjc@xxxxxxxxxx
[2] https://lore.kernel.org/all/0b308fb6dd52bafe7153086c7f54bfad03da74b1.1705965635.git.isaku.yamahata@xxxxxxxxx

> This 'solves' the problem nicely except for the case where the VMM
> deliberately punches holes in memory which the guest is using.

I don't see what problem there is to solve in this case.  PUNCH_HOLE is destructive,
so don't do that.

> The issue in this case is that there's no way of failing the punch hole
> operation - we can detect that the memory is in use and shouldn't be
> freed, but this callback doesn't give the opportunity to actually block
> the freeing of the memory.

Why is this KVM's problem?  E.g. the same exact thing happens without guest_memfd
if userspace munmap()s memory the guest is using.

> Sadly there's no easy way to map from a physical page in a gmem back to
> which VM (and where in the VM) the page is mapped. So actually ripping
> the page out of the appropriate VM isn't really possible in this case.

I don't follow.  guest_memfd has a 1:1 binding with a VM *and* a gfn, how can you
not know what exactly needs to be invalidated?

> How is this situation handled on x86? Is it possible to invalidate and
> then refault a protected page without affecting the memory contents? My
> guess is yes and that is a CCA specific problem - is my understanding
> correct?
> 
> My current thoughts for CCA are one of three options:
> 
> 1. Represent shared and protected memory as two separate memslots. This
> matches the underlying architecture more closely (the top address bit is
> repurposed as a 'shared' flag), but I don't like it because it's a
> deviation from other CoCo architectures (notably pKVM).
> 
> 2. Allow punch-hole to fail on CCA if the memory is mapped into the
> guest's protected space. Again, this is CCA being different and also
> creates nasty corner cases where the gmem descriptor could have to
> outlive the VMM - so looks like a potential source of memory leaks.
> 
> 3. 'Fix' the invalidation to provide more precise semantics. I haven't
> yet prototyped it but it might be possible to simply provide a flag from
> kvm_gmem_invalidate_begin specifying that the invalidation is for the
> protected memory. KVM would then only unmap the protected memory when
> this flag is set (avoiding issues with VMA updates causing spurious unmaps).
> 
> Fairly obviously (3) is my preferred option, but it relies on the
> guarantees that the "invalidation" is actually a precise set of
> addresses where the memory is actually being freed.

#3 is what we are planning for x86, and except for the only_{shared,private} flags,
the requisite functionality should already be in Linus' tree, though it does need
to be wired up for ARM.