On Fri, Feb 09, 2024, Steven Price wrote: > On 16/10/2023 12:50, Michael Roth wrote: > > In some cases, like with SEV-SNP, guest memory needs to be updated in a > > platform-specific manner before it can be safely freed back to the host. > > Wire up arch-defined hooks to the .free_folio kvm_gmem_aops callback to > > allow for special handling of this sort when freeing memory in response > > to FALLOC_FL_PUNCH_HOLE operations and when releasing the inode, and go > > ahead and define an arch-specific hook for x86 since it will be needed > > for handling memory used for SEV-SNP guests. > > Hi all, > > Arm CCA has a similar need to prepare/unprepare memory (granule > delegate/undelegate using our terminology) before it is used for > protected memory. > > However I see a problem with the current gmem implementation that the > "invalidations" are not precise enough for our RMI API. When punching a > hole in the memfd the code currently hits the same path (ending in > kvm_unmap_gfn_range()) as if a VMA is modified in the same range (for > the shared version). > > The Arm CCA architecture doesn't allow the protected memory to be removed and > refaulted without the permission of the guest (the memory contents would be > wiped in this case). TDX behaves almost exactly like CCA. Well, that's not technically true, strictly speaking, as there are TDX APIs that do allow for *temporarily* marking mappings !PRESENT, but those aren't in play for invalidation events like this. SNP does allow zapping page table mappings, but fully removing a page, as PUNCH_HOLE would do, is destructive, so SNP also behaves the same way for all intents and purposes. > One option that I've considered is to implement a seperate CCA ioctl to > notify KVM whether the memory should be mapped protected. That's what KVM_SET_MEMORY_ATTRIBUTES+KVM_MEMORY_ATTRIBUTE_PRIVATE is for, no? > The invalidations would then be ignored on ranges that are currently > protected for this guest. That's backwards. Invalidations on a guest_memfd should affect only *protected* mappings. And for that, the plan/proposal is to plumb only_{shared,private} flags into "struct kvm_gfn_range"[1] so that guest_memfd invalidations don't zap shared mappings, and mmu_notifier invalidation don't zap private mappings. Sample usage in the TDX context[2] (disclaimer, I'm pretty sure I didn't write most of that patch despite, I only provided a rough sketch). [1] https://lore.kernel.org/all/20231027182217.3615211-13-seanjc@xxxxxxxxxx [2] https://lore.kernel.org/all/0b308fb6dd52bafe7153086c7f54bfad03da74b1.1705965635.git.isaku.yamahata@xxxxxxxxx > This 'solves' the problem nicely except for the case where the VMM > deliberately punches holes in memory which the guest is using. I don't see what problem there is to solve in this case. PUNCH_HOLE is destructive, so don't do that. > The issue in this case is that there's no way of failing the punch hole > operation - we can detect that the memory is in use and shouldn't be > freed, but this callback doesn't give the opportunity to actually block > the freeing of the memory. Why is this KVM's problem? E.g. the same exact thing happens without guest_memfd if userspace munmap()s memory the guest is using. > Sadly there's no easy way to map from a physical page in a gmem back to > which VM (and where in the VM) the page is mapped. So actually ripping > the page out of the appropriate VM isn't really possible in this case. I don't follow. guest_memfd has a 1:1 binding with a VM *and* a gfn, how can you not know what exactly needs to be invalidated? > How is this situation handled on x86? Is it possible to invalidate and > then refault a protected page without affecting the memory contents? My > guess is yes and that is a CCA specific problem - is my understanding > correct? > > My current thoughts for CCA are one of three options: > > 1. Represent shared and protected memory as two separate memslots. This > matches the underlying architecture more closely (the top address bit is > repurposed as a 'shared' flag), but I don't like it because it's a > deviation from other CoCo architectures (notably pKVM). > > 2. Allow punch-hole to fail on CCA if the memory is mapped into the > guest's protected space. Again, this is CCA being different and also > creates nasty corner cases where the gmem descriptor could have to > outlive the VMM - so looks like a potential source of memory leaks. > > 3. 'Fix' the invalidation to provide more precise semantics. I haven't > yet prototyped it but it might be possible to simply provide a flag from > kvm_gmem_invalidate_begin specifying that the invalidation is for the > protected memory. KVM would then only unmap the protected memory when > this flag is set (avoiding issues with VMA updates causing spurious unmaps). > > Fairly obviously (3) is my preferred option, but it relies on the > guarantees that the "invalidation" is actually a precise set of > addresses where the memory is actually being freed. #3 is what we are planning for x86, and except for the only_{shared,private} flags, the requisite functionality should already be in Linus' tree, though it does need to be wired up for ARM.