On Mon, Jul 11, 2022 at 04:28:18PM +0800, Yuan Yao <yuan.yao@xxxxxxxxxxxxxxx> wrote: > On Mon, Jun 27, 2022 at 02:53:38PM -0700, isaku.yamahata@xxxxxxxxx wrote: > > From: Isaku Yamahata <isaku.yamahata@xxxxxxxxx> > > > > Allocate mirrored private page table for private page table, and add hooks > > to operate on mirrored private page table. This patch adds only hooks. As > > kvm_gfn_shared_mask() returns false always, those hooks aren't called yet. > > > > Because private guest page is protected, page copy with mmu_notifier to > > migrate page doesn't work. Callback from backing store is needed. > > > > When the faulting GPA is private, the KVM fault is also called private. > > When resolving private KVM, allocate mirrored private page table and call > > hooks to operate on mirrored private page table. On the change of the > > private PTE entry, invoke kvm_x86_ops hook in __handle_changed_spte() to > > propagate the change to mirrored private page table. The following depicts > > the relationship. > > > > private KVM page fault | > > | | > > V | > > private GPA | > > | | > > V | > > KVM private PT root | CPU private PT root > > | | | > > V | V > > private PT ---hook to mirror--->mirrored private PT > > | | | > > \--------------------+------\ | > > | | | > > | V V > > | private guest page > > | > > | > > non-encrypted memory | encrypted memory > > | > > PT: page table > > > > The existing KVM TDP MMU code uses atomic update of SPTE. On populating > > the EPT entry, atomically set the entry. However, it requires TLB > > shootdown to zap SPTE. To address it, the entry is frozen with the special > > SPTE value that clears the present bit. After the TLB shootdown, the entry > > is set to the eventual value (unfreeze). > > > > For mirrored private page table, hooks are called to update mirrored > > private page table in addition to direct access to the private SPTE. For > > the zapping case, it works to freeze the SPTE. It can call hooks in > > addition to TLB shootdown. For populating the private SPTE entry, there > > can be a race condition without further protection > > > > vcpu 1: populating 2M private SPTE > > vcpu 2: populating 4K private SPTE > > vcpu 2: TDX SEAMCALL to update 4K mirrored private SPTE => error > > vcpu 1: TDX SEAMCALL to update 2M mirrored private SPTE > > > > To avoid the race, the frozen SPTE is utilized. Instead of atomic update > > of the private entry, freeze the entry, call the hook that update mirrored > > private SPTE, set the entry to the final value. > > > > Support 4K page only at this stage. 2M page support can be done in future > > patches. > > > > Add is_private member to kvm_page_fault to indicate the fault is private. > > Also is_private member to struct tdp_inter to propagate it. > > > > Co-developed-by: Kai Huang <kai.huang@xxxxxxxxx> > > Signed-off-by: Kai Huang <kai.huang@xxxxxxxxx> > > Signed-off-by: Isaku Yamahata <isaku.yamahata@xxxxxxxxx> > > --- > > arch/x86/include/asm/kvm-x86-ops.h | 2 + > > arch/x86/include/asm/kvm_host.h | 20 +++ > > arch/x86/kvm/mmu/mmu.c | 86 +++++++++- > > arch/x86/kvm/mmu/mmu_internal.h | 37 +++++ > > arch/x86/kvm/mmu/paging_tmpl.h | 2 +- > > arch/x86/kvm/mmu/tdp_iter.c | 1 + > > arch/x86/kvm/mmu/tdp_iter.h | 5 +- > > arch/x86/kvm/mmu/tdp_mmu.c | 247 +++++++++++++++++++++++------ > > arch/x86/kvm/mmu/tdp_mmu.h | 7 +- > > virt/kvm/kvm_main.c | 1 + > > 10 files changed, 346 insertions(+), 62 deletions(-) > > > > diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h > > index 32a6df784ea6..6982d57e4518 100644 > > --- a/arch/x86/include/asm/kvm-x86-ops.h > > +++ b/arch/x86/include/asm/kvm-x86-ops.h > > @@ -93,6 +93,8 @@ KVM_X86_OP_OPTIONAL_RET0(set_tss_addr) > > KVM_X86_OP_OPTIONAL_RET0(set_identity_map_addr) > > KVM_X86_OP(get_mt_mask) > > KVM_X86_OP(load_mmu_pgd) > > +KVM_X86_OP_OPTIONAL(free_private_sp) > > +KVM_X86_OP_OPTIONAL(handle_changed_private_spte) > > KVM_X86_OP(has_wbinvd_exit) > > KVM_X86_OP(get_l2_tsc_offset) > > KVM_X86_OP(get_l2_tsc_multiplier) > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > > index bfc934dc9a33..f2a4d5a18851 100644 > > --- a/arch/x86/include/asm/kvm_host.h > > +++ b/arch/x86/include/asm/kvm_host.h > > @@ -440,6 +440,7 @@ struct kvm_mmu { > > struct kvm_mmu_page *sp); > > void (*invlpg)(struct kvm_vcpu *vcpu, gva_t gva, hpa_t root_hpa); > > struct kvm_mmu_root_info root; > > + hpa_t private_root_hpa; > > union kvm_cpu_role cpu_role; > > union kvm_mmu_page_role root_role; > > > > @@ -1435,6 +1436,20 @@ static inline u16 kvm_lapic_irq_dest_mode(bool dest_mode_logical) > > return dest_mode_logical ? APIC_DEST_LOGICAL : APIC_DEST_PHYSICAL; > > } > > > > +struct kvm_spte { > > + kvm_pfn_t pfn; > > + bool is_present; > > + bool is_leaf; > > +}; > > + > > +struct kvm_spte_change { > > + gfn_t gfn; > > + enum pg_level level; > > + struct kvm_spte old; > > + struct kvm_spte new; > > + void *sept_page; > > +}; > > + > > struct kvm_x86_ops { > > const char *name; > > > > @@ -1547,6 +1562,11 @@ struct kvm_x86_ops { > > void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa, > > int root_level); > > > > + int (*free_private_sp)(struct kvm *kvm, gfn_t gfn, enum pg_level level, > > + void *private_sp); > > + void (*handle_changed_private_spte)( > > + struct kvm *kvm, const struct kvm_spte_change *change); > > + > > bool (*has_wbinvd_exit)(void); > > > > u64 (*get_l2_tsc_offset)(struct kvm_vcpu *vcpu); > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > > index a5bf3e40e209..ef925722ee28 100644 > > --- a/arch/x86/kvm/mmu/mmu.c > > +++ b/arch/x86/kvm/mmu/mmu.c > > @@ -1577,7 +1577,11 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range) > > flush = kvm_handle_gfn_range(kvm, range, kvm_unmap_rmapp); > > > > if (is_tdp_mmu_enabled(kvm)) > > - flush = kvm_tdp_mmu_unmap_gfn_range(kvm, range, flush); > > + /* > > + * private page needs to be kept and handle page migration > > + * on next EPT violation. > > + */ > > + flush = kvm_tdp_mmu_unmap_gfn_range(kvm, range, flush, false); > > > > return flush; > > } > > @@ -3082,7 +3086,8 @@ static int handle_abnormal_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fau > > * SPTE value without #VE suppress bit cleared > > * (kvm->arch.shadow_mmio_value = 0). > > */ > > - if (unlikely(!vcpu->kvm->arch.enable_mmio_caching) || > > + if (unlikely(!vcpu->kvm->arch.enable_mmio_caching && > > + !kvm_gfn_shared_mask(vcpu->kvm)) || > > unlikely(fault->gfn > kvm_mmu_max_gfn())) > > return RET_PF_EMULATE; > > } > > @@ -3454,7 +3459,12 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu) > > goto out_unlock; > > > > if (is_tdp_mmu_enabled(vcpu->kvm)) { > > - root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu); > > + if (kvm_gfn_shared_mask(vcpu->kvm) && > > + !VALID_PAGE(mmu->private_root_hpa)) { > > + root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu, true); > > + mmu->private_root_hpa = root; > > + } > > + root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu, false); > > mmu->root.hpa = root; > > } else if (shadow_root_level >= PT64_ROOT_4LEVEL) { > > root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level, true); > > @@ -4026,6 +4036,32 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work) > > kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true); > > } > > > > +/* > > + * Private page can't be release on mmu_notifier without losing page contents. > > + * The help, callback, from backing store is needed to allow page migration. > > + * For now, pin the page. > > + */ > > +static int kvm_faultin_pfn_private_mapped(struct kvm_vcpu *vcpu, > > + struct kvm_page_fault *fault) > > +{ > > + hva_t hva = gfn_to_hva_memslot(fault->slot, fault->gfn); > > + struct page *page[1]; > > + > > + fault->map_writable = false; > > + fault->pfn = KVM_PFN_ERR_FAULT; > > + if (hva == KVM_HVA_ERR_RO_BAD || hva == KVM_HVA_ERR_BAD) > > + return RET_PF_CONTINUE; > > + > > + /* TDX allows only RWX. Read-only isn't supported. */ > > + WARN_ON_ONCE(!fault->write); > > + if (pin_user_pages_fast(hva, 1, FOLL_WRITE, page) != 1) > > + return RET_PF_INVALID; > > + > > + fault->map_writable = true; > > + fault->pfn = page_to_pfn(page[0]); > > + return RET_PF_CONTINUE; > > +} > > + > > static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) > > { > > struct kvm_memory_slot *slot = fault->slot; > > @@ -4058,6 +4094,9 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) > > return RET_PF_EMULATE; > > } > > > > + if (fault->is_private) > > + return kvm_faultin_pfn_private_mapped(vcpu, fault); > > + > > async = false; > > fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, &async, > > fault->write, &fault->map_writable, > > @@ -4110,6 +4149,17 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu, > > mmu_notifier_retry_hva(vcpu->kvm, mmu_seq, fault->hva); > > } > > > > +void kvm_mmu_release_fault(struct kvm *kvm, struct kvm_page_fault *fault, int r) > > +{ > > + if (is_error_noslot_pfn(fault->pfn) || kvm_is_reserved_pfn(fault->pfn)) > > + return; > > + > > + if (fault->is_private) > > + put_page(pfn_to_page(fault->pfn)); > > The pin_user_pages_fast() is used above which has FOLL_PIN set > internal, so should we use unpin_user_page() here ? The FOLL_PIN means > the unpin should be done by unpin_user_page() but not put_page, please > see /Documentation/core-api/pin_user_pages.rst and comments on > FOLL_PIN; To align with large page support, I'll make it to use get_user_pages_fast() and put_page(). -- Isaku Yamahata <isaku.yamahata@xxxxxxxxx>