On Mon, Jul 15, 2024 at 10:57:35PM +0000, Edgecombe, Rick P wrote: > On Mon, 2024-07-15 at 16:47 -0500, Michael Roth wrote: > > Makes sense. > > > > If we document mutual exclusion between ranges touched by > > gmem_populate() vs ranges touched by actual userspace issuance of > > KVM_PRE_FAULT_MEMORY (and make sure nothing too crazy happens if users > > don't abide by the documentation), then I think most problems go away... > > > > But there is still at least one awkward constraint for SNP: > > KVM_PRE_FAULT_MEMORY cannot be called for private GPA ranges until after > > SNP_LAUNCH_START is called. This is true even if the GPA range is not > > one of the ranges that will get passed to > > gmem_populate()/SNP_LAUNCH_UPDATE. The reason for this is that when > > binding the ASID to the SNP context as part of SNP_LAUNCH_START, firmware > > will perform checks to make sure that ASID is not already being used in > > the RMP table, and that check will fail if KVM_PRE_FAULT_MEMORY triggered > > for a private page before calling SNP_LAUNCH_START. > > > > So we have this constraint that KVM_PRE_FAULT_MEMORY can't be issued > > before SNP_LAUNCH_START. So it makes me wonder if we should just broaden > > that constraint and for now just disallow KVM_PRE_FAULT_MEMORY prior to > > finalizing a guest, since it'll be easier to lift that restriction later > > versus discovering some other sort of edge case and need to > > retroactively place restrictions. > > > > I've taken Isaku's original pre_fault_memory_test and added a new > > x86-specific coco_pre_fault_memory_test to try to better document and > > exercise these corner cases for SEV and SNP, but I'm hoping it could > > also be useful for TDX (hence the generic name). These are based on > > Pratik's initial SNP selftests (which are in turn based on kvm/queue + > > these patches): > > > > https://github.com/mdroth/linux/commits/snp-uptodate0-kst/ > > > > https://github.com/mdroth/linux/commit/dd7d4b1983ceeb653132cfd54ad63f597db85f74 > > > > > > From the TDX side it wouldn't be horrible to not have to worry about userspace > mucking around with the mirrored page tables in unexpected ways during the > special period. TDX already has its own "finalized" state in kvm_tdx, is there > something similar on the SEV side we could unify with? While trying to doing pre-mapping in sev_gmem_post_populate() like you've done below for TDX, I hit another issue that I think would be avoided by enforcing finalization (or otherwise needs some alternative fix within this series). I'd already mentioned the issue with gmem_prepare() putting pages in an unexpected RMP state if called prior to initializing the same GPA range in gmem_populate(). This get handled gracefully however when the firmware call is issued to encrypt/measure the pages and it re-checks the page's RMP state. However, if another thread is issuing KVM_PRE_FAULT_MEMORY and triggers a gmem_prepare() just after gmem_populate() places the pages in a private RMP state, then gmem_prepare()->kvm_gmem_get_pfn() will trigger the clear_highpage() call in kvm_gmem_prepare_folio() because the uptodate flag isn't set until after sev_gmem_post_populate() returns. So I ended up just calling kvm_arch_vcpu_pre_fault_memory() on the full GPA range after the post-populate callback returns: https://github.com/mdroth/linux/commit/da4e7465ced1a708ff1c5e9ab27c570de7e8974e ... But a much larger concern is that even without this series, but just plain kvm/queue, KVM_PRE_FAULT_MEMORY can still race with sev_gmem_post_populate() in bad ways, so I think for 6.11 we should consider locking this finalization requirement in and applying something like the below fix: https://github.com/mdroth/linux/commit/7095ba6ee8f050af11620daaf6c2219fd0bbb1c3 Or if that's potentially too big a chance to decide right now, we could just make KVM_PRE_FAULT_MEMORY return EOPNOTSUPP for kvm_arch_has_private_memory() until we've had more time to work out the missing bits for CoCo there. -Mike > > I looked at moving from kvm_arch_vcpu_pre_fault_memory() to directly calling > kvm_tdp_map_page(), so we could potentially put whatever check in > kvm_arch_vcpu_pre_fault_memory(). It required a couple exports: > > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h > index 03737f3aaeeb..9004ac597a85 100644 > --- a/arch/x86/kvm/mmu.h > +++ b/arch/x86/kvm/mmu.h > @@ -277,6 +277,7 @@ extern bool tdp_mmu_enabled; > #endif > > int kvm_tdp_mmu_get_walk_mirror_pfn(struct kvm_vcpu *vcpu, u64 gpa, kvm_pfn_t > *pfn); > +int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 > *level); > > static inline bool kvm_memslots_have_rmaps(struct kvm *kvm) > { > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index 7bb6b17b455f..4a3e471ec9fe 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -4721,8 +4721,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct > kvm_page_fault *fault) > return direct_page_fault(vcpu, fault); > } > > -static int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, > - u8 *level) > +int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 > *level) > { > int r; > > @@ -4759,6 +4758,7 @@ static int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t > gpa, u64 error_code, > return -EIO; > } > } > +EXPORT_SYMBOL_GPL(kvm_tdp_map_page); > > long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu, > struct kvm_pre_fault_memory *range) > @@ -5770,6 +5770,7 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu) > out: > return r; > } > +EXPORT_SYMBOL_GPL(kvm_mmu_load); > > void kvm_mmu_unload(struct kvm_vcpu *vcpu) > { > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c > index 9ac0821eb44b..7161ef68f3da 100644 > --- a/arch/x86/kvm/vmx/tdx.c > +++ b/arch/x86/kvm/vmx/tdx.c > @@ -2809,11 +2809,13 @@ struct tdx_gmem_post_populate_arg { > static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, > void __user *src, int order, void *_arg) > { > + u64 error_code = PFERR_GUEST_FINAL_MASK | PFERR_PRIVATE_ACCESS; > struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); > struct tdx_gmem_post_populate_arg *arg = _arg; > struct kvm_vcpu *vcpu = arg->vcpu; > struct kvm_memory_slot *slot; > gpa_t gpa = gfn_to_gpa(gfn); > + u8 level = PG_LEVEL_4K; > struct page *page; > kvm_pfn_t mmu_pfn; > int ret, i; > @@ -2832,6 +2834,10 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t > gfn, kvm_pfn_t pfn, > goto out_put_page; > } > > + ret = kvm_tdp_map_page(vcpu, gpa, error_code, &level); > + if (ret < 0) > + goto out_put_page; > + > read_lock(&kvm->mmu_lock); > > ret = kvm_tdp_mmu_get_walk_mirror_pfn(vcpu, gpa, &mmu_pfn); > @@ -2910,6 +2916,7 @@ static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu, > struct kvm_tdx_cmd *c > mutex_lock(&kvm->slots_lock); > idx = srcu_read_lock(&kvm->srcu); > > + kvm_mmu_reload(vcpu); > ret = 0; > while (region.nr_pages) { > if (signal_pending(current)) { >