On Wed, 2024-05-15 at 18:48 -0700, Isaku Yamahata wrote: > On Thu, May 16, 2024 at 12:52:32PM +1200, > "Huang, Kai" <kai.huang@xxxxxxxxx> wrote: > > > On 15/05/2024 12:59 pm, Rick Edgecombe wrote: > > > From: Isaku Yamahata <isaku.yamahata@xxxxxxxxx> > > > > > > Allocate mirrored page table for the private page table and implement MMU > > > hooks to operate on the private page table. > > > > > > To handle page fault to a private GPA, KVM walks the mirrored page table > > > in > > > unencrypted memory and then uses MMU hooks in kvm_x86_ops to propagate > > > changes from the mirrored page table to private page table. > > > > > > private KVM page fault | > > > | | > > > V | > > > private GPA | CPU protected EPTP > > > | | | > > > V | V > > > mirrored PT root | private PT root > > > | | | > > > V | V > > > mirrored PT --hook to propagate-->private PT > > > | | | > > > \--------------------+------\ | > > > | | | > > > | V V > > > | private guest page > > > | > > > | > > > non-encrypted memory | encrypted memory > > > | > > > > > > PT: page table > > > Private PT: the CPU uses it, but it is invisible to KVM. TDX module > > > manages > > > this table to map private guest pages. > > > Mirrored PT:It is visible to KVM, but the CPU doesn't use it. KVM uses it > > > to propagate PT change to the actual private PT. > > > > > > SPTEs in mirrored page table (refer to them as mirrored SPTEs hereafter) > > > can be modified atomically with mmu_lock held for read, however, the MMU > > > hooks to private page table are not atomical operations. > > > > > > To address it, a special REMOVED_SPTE is introduced and below sequence is > > > used when mirrored SPTEs are updated atomically. > > > > > > 1. Mirrored SPTE is first atomically written to REMOVED_SPTE. > > > 2. The successful updater of the mirrored SPTE in step 1 proceeds with the > > > following steps. > > > 3. Invoke MMU hooks to modify private page table with the target value. > > > 4. (a) On hook succeeds, update mirrored SPTE to target value. > > > (b) On hook failure, restore mirrored SPTE to original value. > > > > > > KVM TDP MMU ensures other threads will not overrite REMOVED_SPTE. > > > > > > This sequence also applies when SPTEs are atomiclly updated from > > > non-present to present in order to prevent potential conflicts when > > > multiple vCPUs attempt to set private SPTEs to a different page size > > > simultaneously, though 4K page size is only supported for private page > > > table currently. > > > > > > 2M page support can be done in future patches. > > > > > > Signed-off-by: Isaku Yamahata <isaku.yamahata@xxxxxxxxx> > > > Co-developed-by: Kai Huang <kai.huang@xxxxxxxxx> > > > Signed-off-by: Kai Huang <kai.huang@xxxxxxxxx> > > > Co-developed-by: Yan Zhao <yan.y.zhao@xxxxxxxxx> > > > Signed-off-by: Yan Zhao <yan.y.zhao@xxxxxxxxx> > > > Co-developed-by: Rick Edgecombe <rick.p.edgecombe@xxxxxxxxx> > > > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@xxxxxxxxx> > > > --- > > > TDX MMU Part 1: > > > - Remove unnecessary gfn, access twist in > > > tdp_mmu_map_handle_target_level(). (Chao Gao) > > > - Open code call to kvm_mmu_alloc_private_spt() instead oCf doing it in > > > tdp_mmu_alloc_sp() > > > - Update comment in set_private_spte_present() (Yan) > > > - Open code call to kvm_mmu_init_private_spt() (Yan) > > > - Add comments on TDX MMU hooks (Yan) > > > - Fix various whitespace alignment (Yan) > > > - Remove pointless warnings and conditionals in > > > handle_removed_private_spte() (Yan) > > > - Remove redundant lockdep assert in tdp_mmu_set_spte() (Yan) > > > - Remove incorrect comment in handle_changed_spte() (Yan) > > > - Remove unneeded kvm_pfn_to_refcounted_page() and > > > is_error_noslot_pfn() check in kvm_tdp_mmu_map() (Yan) > > > - Do kvm_gfn_for_root() branchless (Rick) > > > - Update kvm_tdp_mmu_alloc_root() callers to not check error code (Rick) > > > - Add comment for stripping shared bit for fault.gfn (Chao) > > > > > > v19: > > > - drop CONFIG_KVM_MMU_PRIVATE > > > > > > v18: > > > - Rename freezed => frozen > > > > > > v14 -> v15: > > > - Refined is_private condition check in kvm_tdp_mmu_map(). > > > Add kvm_gfn_shared_mask() check. > > > - catch up for struct kvm_range change > > > --- > > > arch/x86/include/asm/kvm-x86-ops.h | 5 + > > > arch/x86/include/asm/kvm_host.h | 25 +++ > > > arch/x86/kvm/mmu/mmu.c | 13 +- > > > arch/x86/kvm/mmu/mmu_internal.h | 19 +- > > > arch/x86/kvm/mmu/tdp_iter.h | 2 +- > > > arch/x86/kvm/mmu/tdp_mmu.c | 269 +++++++++++++++++++++++++---- > > > arch/x86/kvm/mmu/tdp_mmu.h | 2 +- > > > 7 files changed, 293 insertions(+), 42 deletions(-) > > > > > > diff --git a/arch/x86/include/asm/kvm-x86-ops.h > > > b/arch/x86/include/asm/kvm-x86-ops.h > > > index 566d19b02483..d13cb4b8fce6 100644 > > > --- a/arch/x86/include/asm/kvm-x86-ops.h > > > +++ b/arch/x86/include/asm/kvm-x86-ops.h > > > @@ -95,6 +95,11 @@ KVM_X86_OP_OPTIONAL_RET0(set_tss_addr) > > > KVM_X86_OP_OPTIONAL_RET0(set_identity_map_addr) > > > KVM_X86_OP_OPTIONAL_RET0(get_mt_mask) > > > KVM_X86_OP(load_mmu_pgd) > > > +KVM_X86_OP_OPTIONAL(link_private_spt) > > > +KVM_X86_OP_OPTIONAL(free_private_spt) > > > +KVM_X86_OP_OPTIONAL(set_private_spte) > > > +KVM_X86_OP_OPTIONAL(remove_private_spte) > > > +KVM_X86_OP_OPTIONAL(zap_private_spte) > > > KVM_X86_OP(has_wbinvd_exit) > > > KVM_X86_OP(get_l2_tsc_offset) > > > KVM_X86_OP(get_l2_tsc_multiplier) > > > diff --git a/arch/x86/include/asm/kvm_host.h > > > b/arch/x86/include/asm/kvm_host.h > > > index d010ca5c7f44..20fa8fa58692 100644 > > > --- a/arch/x86/include/asm/kvm_host.h > > > +++ b/arch/x86/include/asm/kvm_host.h > > > @@ -470,6 +470,7 @@ struct kvm_mmu { > > > int (*sync_spte)(struct kvm_vcpu *vcpu, > > > struct kvm_mmu_page *sp, int i); > > > struct kvm_mmu_root_info root; > > > + hpa_t private_root_hpa; > > > > Should we have > > > > struct kvm_mmu_root_info private_root; > > > > instead? > > Yes. And the private root allocation can be pushed down into TDP MMU. Why? > [snip] > > > @@ -7263,6 +7266,12 @@ int kvm_mmu_vendor_module_init(void) > > > void kvm_mmu_destroy(struct kvm_vcpu *vcpu) > > > { > > > kvm_mmu_unload(vcpu); > > > + if (tdp_mmu_enabled) { > > > + read_lock(&vcpu->kvm->mmu_lock); > > > + mmu_free_root_page(vcpu->kvm, &vcpu->arch.mmu- > > > >private_root_hpa, > > > + NULL); > > > + read_unlock(&vcpu->kvm->mmu_lock); > > > + } > > > > Hmm.. I don't quite like this, but sorry I kinda forgot why we need to to > > this here. > > > > Could you elaborate? > > > > Anyway, from common code's perspective, we need to have some clarification > > why we design to do it here. > > This should be cleaned up. It can be pushed down into > kvm_tdp_mmu_alloc_root(). > > void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu) > allocate shared root > if (has_mirrort_pt) > allocate private root > Huh? This is kvm_mmu_destroy()... > > > > free_mmu_pages(&vcpu->arch.root_mmu); > > > free_mmu_pages(&vcpu->arch.guest_mmu); > > > mmu_free_memory_caches(vcpu); > > > diff --git a/arch/x86/kvm/mmu/mmu_internal.h > > > b/arch/x86/kvm/mmu/mmu_internal.h > > > index 0f1a9d733d9e..3a7fe9261e23 100644 > > > --- a/arch/x86/kvm/mmu/mmu_internal.h > > > +++ b/arch/x86/kvm/mmu/mmu_internal.h > > > @@ -6,6 +6,8 @@ > > > #include <linux/kvm_host.h> > > > #include <asm/kvm_host.h> > > > +#include "mmu.h" > > > + > > > #ifdef CONFIG_KVM_PROVE_MMU > > > #define KVM_MMU_WARN_ON(x) WARN_ON_ONCE(x) > > > #else > > > @@ -178,6 +180,16 @@ static inline void kvm_mmu_alloc_private_spt(struct > > > kvm_vcpu *vcpu, struct kvm_m > > > sp->private_spt = kvm_mmu_memory_cache_alloc(&vcpu- > > > >arch.mmu_private_spt_cache); > > > } > > > +static inline gfn_t kvm_gfn_for_root(struct kvm *kvm, struct kvm_mmu_page > > > *root, > > > + gfn_t gfn) > > > +{ > > > + gfn_t gfn_for_root = kvm_gfn_to_private(kvm, gfn); > > > + > > > + /* Set shared bit if not private */ > > > + gfn_for_root |= -(gfn_t)!is_private_sp(root) & > > > kvm_gfn_shared_mask(kvm); > > > + return gfn_for_root; > > > +} > > > + > > > static inline bool kvm_mmu_page_ad_need_write_protect(struct > > > kvm_mmu_page *sp) > > > { > > > /* > > > @@ -348,7 +360,12 @@ static inline int __kvm_mmu_do_page_fault(struct > > > kvm_vcpu *vcpu, gpa_t cr2_or_gp > > > int r; > > > if (vcpu->arch.mmu->root_role.direct) { > > > - fault.gfn = fault.addr >> PAGE_SHIFT; > > > + /* > > > + * Things like memslots don't understand the concept of a > > > shared > > > + * bit. Strip it so that the GFN can be used like normal, > > > and the > > > + * fault.addr can be used when the shared bit is needed. > > > + */ > > > + fault.gfn = gpa_to_gfn(fault.addr) & > > > ~kvm_gfn_shared_mask(vcpu->kvm); > > > fault.slot = kvm_vcpu_gfn_to_memslot(vcpu, fault.gfn); > > > > Again, I don't think it's nessary for fault.gfn to still have the shared bit > > here? > > > > This kinda usage is pretty much the reason I want to get rid of > > kvm_gfn_shared_mask(). > > We are going to flags like has_mirrored_pt and we have root page table > iterator > with types specified. I'll investigate how we can reduce (or eliminate) > those helper functions. Let's transition the abusers off and see whats left. I'm still waiting for an explanation of why they are bad when uses properly. [snip] > > > > /* The level of the root page given to the iterator */ > > > int root_level; > > > > [...] > > > > > for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end) { > > > @@ -1029,8 +1209,8 @@ static int tdp_mmu_map_handle_target_level(struct > > > kvm_vcpu *vcpu, > > > new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL); > > > else > > > wrprot = make_spte(vcpu, sp, fault->slot, ACC_ALL, iter- > > > >gfn, > > > - fault->pfn, iter->old_spte, > > > fault->prefetch, true, > > > - fault->map_writable, &new_spte); > > > + fault->pfn, iter->old_spte, fault- > > > >prefetch, true, > > > + fault->map_writable, &new_spte); > > > if (new_spte == iter->old_spte) > > > ret = RET_PF_SPURIOUS; > > > @@ -1108,6 +1288,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct > > > kvm_page_fault *fault) > > > struct kvm *kvm = vcpu->kvm; > > > struct tdp_iter iter; > > > struct kvm_mmu_page *sp; > > > + gfn_t raw_gfn; > > > + bool is_private = fault->is_private && kvm_gfn_shared_mask(kvm); > > > > Ditto. I wish we can have 'has_mirrored_private_pt'. > > Which name do you prefer? has_mirrored_pt or has_mirrored_private_pt? Why not helpers that wrap vm_type like: https://lore.kernel.org/kvm/d4c96caffd2633a70a140861d91794cdb54c7655.camel@xxxxxxxxx/