On Thu, Jun 15, 2023 at 01:12:18PM -0700, isaku.yamahata@xxxxxxxxx wrote: > From: Isaku Yamahata <isaku.yamahata@xxxxxxxxx> > > TDX and SEV-SNP need to know the reason for a callback by > kvm_unmap_gfn_range(). mmu notifier, set memory attributes ioctl or KVM > gmem callback. The callback handler changes the behavior or does the > additional housekeeping operation. For mmu notifier, it's zapping shared > PTE. For set memory attributes, it's the conversion of memory attributes > (private <=> shared). For KVM gmem, it's punching a hole in the range, and > releasing the file. I think it's still an open topic that we need to hear more from Sean about: https://lore.kernel.org/lkml/20230522235838.ov3722lcusotzlvo@xxxxxxx/ but I *think* we were leaning toward decoupling the act of invalidating GFNs, vs. the act of invalidating/free'ing gmem pages. One concrete example of where this seperation makes sense if with hole-punching. SNP has unique platform-specific stuff it has to do before free'ing that gmem page back to the host. If we try to plumb this through kvm_unmap_gfn_range() via a special flag then it's a little awkward because: a) Presumably that hole-punch would have occurred after a preceeding KVM_SET_MEMORY_ATTRIBUTES was issued to switch the page to shared state in the xarray. So all it should really need to do is handle that platform-specific behavior, like updating RMP table in case of SNP. But none of the other details like GFN ranges are relevant in that case, RMP updates here only need the PFN, so we end up walking memslots to do GFN->PFN translations, when it would actually be much more efficient to do these translations by translating the hole-punched FD offset range to the corresponding folio()'s backing those ranges b) It places an unecessary dependency on having an active memslot to do those translations. This ends up not being an issue with current version of gmem patchset because the release() happens *before* gmem_unbind(), so there is a memslot associated with the ranges at gmem_release() time, but in the initial version of gmem it was the reverse, so if things ever changed again in this regard we'd once again have to completely rework how to issue these platform-specific invalidation callbacks. I really *really* like having a separate, simple invalidation mechanism in place that just converts FD offsets to PFNs and then passes those on to platform-defined handlers to clean up pages before free'ing them back to the system. It's versatile in that it can be called pretty much anywhere regardless of where we are in KVM lifecycle, it's robust in that it doesn't rely on unecessary outside dependencies, and it avoids added uneeded complexity to paths like kvm_unmap_gfn_range(). That's the approach taken with SNP hypervisor v9 series, with the gmem hook being introduced here: https://lore.kernel.org/kvm/20230612042559.375660-1-michael.roth@xxxxxxx/T/#m3ad8245235a27ed0f41c359c191dcda6c77af043 and the SEV-SNP implementation of that hook being here: https://lore.kernel.org/kvm/20230612042559.375660-1-michael.roth@xxxxxxx/T/#m6ac04b44722dbc07839011816e94fadf5ad6794e Would a similar approach work for TDX? At least WRT cleaning up pages before returning them back to the host? If we could isolate that requirement/handling from all the other aspects of invalidations it really seems like it would cause us less headaches down the road. > > Signed-off-by: Isaku Yamahata <isaku.yamahata@xxxxxxxxx> > --- > include/linux/kvm_host.h | 11 ++++++++++- > virt/kvm/guest_mem.c | 10 +++++++--- > virt/kvm/kvm_main.c | 4 +++- > 3 files changed, 20 insertions(+), 5 deletions(-) > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index 1a47cedae8a1..c049c0aa44d6 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -256,12 +256,21 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu); > #endif > > #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER > + > +#define KVM_GFN_RANGE_FLAGS_SET_MEM_ATTR BIT(0) Can you go into more detail on why special handling is needed for SET_MEM_ATTR? > +#define KVM_GFN_RANGE_FLAGS_GMEM_PUNCH_HOLE BIT(1) > +#define KVM_GFN_RANGE_FLAGS_GMEM_RELEASE BIT(2) Would the need to distinguish between PUNCH_HOLE/RELEASE go away in the TDX case if you take the above approach? For SNP, the answer is yes. If that's also the case for TDX I think that would be another argument in favor of decoupling these from existing KVM MMU invalidation path. -Mike > + > struct kvm_gfn_range { > struct kvm_memory_slot *slot; > gfn_t start; > gfn_t end; > - pte_t pte; > + union { > + pte_t pte; > + u64 attrs; > + }; > bool may_block; > + unsigned int flags; > }; > bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range); > bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range); > diff --git a/virt/kvm/guest_mem.c b/virt/kvm/guest_mem.c > index cdf2d84683c8..30b8f66784d4 100644 > --- a/virt/kvm/guest_mem.c > +++ b/virt/kvm/guest_mem.c > @@ -99,7 +99,8 @@ static struct folio *kvm_gmem_get_folio(struct file *file, pgoff_t index) > } > > static void kvm_gmem_invalidate_begin(struct kvm *kvm, struct kvm_gmem *gmem, > - pgoff_t start, pgoff_t end) > + pgoff_t start, pgoff_t end, > + unsigned int flags) > { > struct kvm_memory_slot *slot; > unsigned long index; > @@ -118,6 +119,7 @@ static void kvm_gmem_invalidate_begin(struct kvm *kvm, struct kvm_gmem *gmem, > .slot = slot, > .pte = __pte(0), > .may_block = true, > + .flags = flags, > }; > > kvm_mmu_invalidate_range_add(kvm, gfn_range.start, gfn_range.end); > @@ -156,7 +158,8 @@ static long kvm_gmem_punch_hole(struct file *file, loff_t offset, loff_t len) > */ > filemap_invalidate_lock(file->f_mapping); > > - kvm_gmem_invalidate_begin(kvm, gmem, start, end); > + kvm_gmem_invalidate_begin(kvm, gmem, start, end, > + KVM_GFN_RANGE_FLAGS_GMEM_PUNCH_HOLE); > > truncate_inode_pages_range(file->f_mapping, offset, offset + len - 1); > > @@ -263,7 +266,8 @@ static int kvm_gmem_release(struct inode *inode, struct file *file) > * Free the backing memory, and more importantly, zap all SPTEs that > * pointed at this file. > */ > - kvm_gmem_invalidate_begin(kvm, gmem, 0, -1ul); > + kvm_gmem_invalidate_begin(kvm, gmem, 0, -1ul, > + KVM_GFN_RANGE_FLAGS_GMEM_RELEASE); > truncate_inode_pages_final(file->f_mapping); > kvm_gmem_invalidate_end(kvm, gmem, 0, -1ul); > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 422d49634c56..9cdfa2fb675f 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -613,6 +613,7 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm, > gfn_range.start = hva_to_gfn_memslot(hva_start, slot); > gfn_range.end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, slot); > gfn_range.slot = slot; > + gfn_range.flags = 0; > > if (!locked) { > locked = true; > @@ -2391,8 +2392,9 @@ static void kvm_mem_attrs_changed(struct kvm *kvm, unsigned long attrs, > bool flush = false; > int i; > > - gfn_range.pte = __pte(0); > + gfn_range.attrs = attrs; > gfn_range.may_block = true; > + gfn_range.flags = KVM_GFN_RANGE_FLAGS_SET_MEM_ATTR; > > for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) { > slots = __kvm_memslots(kvm, i); > -- > 2.25.1 >