Re: [PATCH v19 062/130] KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU

"Edgecombe, Rick P" <rick.p.edgecombe@xxxxxxxxx> · Mon, 25 Mar 2024 22:31:49 +0000

On Mon, 2024-03-25 at 13:01 -0700, Isaku Yamahata wrote:
>  Also, kvm_tdp_mmu_alloc_root() never returns non-zero, even though mmu_alloc_direct_roots() does.
> > Probably today when there is one caller it makes mmu_alloc_direct_roots() cleaner to just have
> > it
> > return the always zero value from kvm_tdp_mmu_alloc_root(). Now that there are two calls, I
> > think we
> > should refactor kvm_tdp_mmu_alloc_root() to return void, and have kvm_tdp_mmu_alloc_root()
> > return 0
> > manually in this case.
> > 
> > Or maybe instead change it back to returning an hpa_t and then kvm_tdp_mmu_alloc_root() can lose
> > the
> > "if (private)" logic at the end too.
> 
> Probably we can make void kvm_tdp_mmu_alloc_root() instead of returning always
> zero as clean up.

Why is it better than returning an hpa_t once we are calling it twice for mirror and shared roots.

> 
> 
> > >         } else if (shadow_root_level >= PT64_ROOT_4LEVEL) {
> > >                 root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level);
> > > @@ -4627,7 +4632,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault
> > > *fault)
> > >         if (kvm_mmu_honors_guest_mtrrs(vcpu->kvm)) {
> > >                 for ( ; fault->max_level > PG_LEVEL_4K; --fault->max_level) {
> > >                         int page_num = KVM_PAGES_PER_HPAGE(fault->max_level);
> > > -                       gfn_t base = gfn_round_for_level(fault->gfn,
> > > +                       gfn_t base = gfn_round_for_level(gpa_to_gfn(fault->addr),
> > >                                                          fault->max_level);
> > >  
> > >                         if (kvm_mtrr_check_gfn_range_consistency(vcpu, base, page_num))
> > > @@ -4662,6 +4667,7 @@ int kvm_mmu_map_tdp_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64
> > > error_code,
> > >         };
> > >  
> > >         WARN_ON_ONCE(!vcpu->arch.mmu->root_role.direct);
> > > +       fault.gfn = gpa_to_gfn(fault.addr) & ~kvm_gfn_shared_mask(vcpu->kvm);
> > >         fault.slot = kvm_vcpu_gfn_to_memslot(vcpu, fault.gfn);
> > >  
> > >         r = mmu_topup_memory_caches(vcpu, false);
> > > @@ -6166,6 +6172,7 @@ static int __kvm_mmu_create(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
> > >  
> > >         mmu->root.hpa = INVALID_PAGE;
> > >         mmu->root.pgd = 0;
> > > +       mmu->private_root_hpa = INVALID_PAGE;
> > >         for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
> > >                 mmu->prev_roots[i] = KVM_MMU_ROOT_INFO_INVALID;
> > >  
> > > @@ -7211,6 +7218,12 @@ int kvm_mmu_vendor_module_init(void)
> > >  void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
> > >  {
> > >         kvm_mmu_unload(vcpu);
> > > +       if (tdp_mmu_enabled) {
> > > +               write_lock(&vcpu->kvm->mmu_lock);
> > > +               mmu_free_root_page(vcpu->kvm, &vcpu->arch.mmu->private_root_hpa,
> > > +                               NULL);
> > > +               write_unlock(&vcpu->kvm->mmu_lock);
> > 
> > What is the reason for the special treatment of private_root_hpa here? The rest of the roots are
> > freed in kvm_mmu_unload(). I think it is because we don't want the mirror to get freed during
> > kvm_mmu_reset_context()?
> 
> It reflects that we don't free Secure-EPT pages during runtime, and free them
> when destroying the guest.

Right. If would be great if we could do something like warn on freeing role.private = 1 sp's during
runtime. It could cover several cases that get worried about in other patches.

While looking at how we could do this, I noticed that kvm_arch_vcpu_create() calls kvm_mmu_destroy()
in an error path. So this could end up zapping/freeing a private root. It should be bad userspace
behavior too I guess. But the number of edge cases makes me think the case of zapping private sp
while a guest is running is something that deserves a VM_BUG_ON().

> 
> 
> > 
> > Oof. For the sake of trying to justify the code, I'm trying to keep track of the pros and cons
> > of
> > treating the mirror/private root like a normal one with just a different role bit.
> > 
> > The whole “list of roots” thing seems to date from the shadow paging, where there is is critical
> > to
> > keep multiple cached shared roots of different CPU modes of the same shadowed page tables. Today
> > with non-nested TDP, AFAICT, the only different root is for SMM. I guess since the machinery for
> > managing multiple roots in a list already exists it makes sense to use it for both.
> > 
> > For TDX there are also only two, but the difference is, things need to be done in special ways
> > for
> > the two roots. You end up with a bunch of loops (for_each_*tdp_mmu_root(), etc) that essentially
> > process a list of two different roots, but with inner logic tortured to work for the
> > peculiarities
> > of both private and shared. An easier to read alternative could be to open code both cases.
> > 
> > I guess the major benefit is to keep one set of logic for shadow paging, normal TDP and TDX, but
> > it
> > makes the logic a bit difficult to follow for TDX compared to looking at it from the normal
> > guest
> > perspective. So I wonder if making special versions of the TDX root traversing operations might
> > make
> > the code a little easier to follow. I’m not advocating for it at this point, just still working
> > on
> > an opinion. Is there any history around this design point?
> 
> The original desire to keep the modification contained, and not introduce a
> function for population and zap.  With the open coding, do you want something
> like the followings?  We can try it and compare the outcome.
> 
> For zapping
>   if (private) {
>      __for_each_tdp_mmu_root_yield_safe_private()
>        private case
>   } else {
>      __for_each_tdp_mmu_root_yield_safe()
>         shared case
>   }
> 
> For fault,
> kvm_tdp_mmu_map()
>   if (private) {
>     tdp_mmu_for_each_pte_private(iter, mmu, raw_gfn, raw_gfn + 1)
>       private case
>   } else {
>     tdp_mmu_for_each_pte_private(iter, mmu, raw_gfn, raw_gfn + 1)
>       shared case
>   }

I was wondering about something limited to the operations that iterate over the roots. So not
keeping private_root_hpa in the list of roots where it has to be carefully protected from getting
zapped or get its gfn adjusted, and instead open coding the private case in the higher level zapping
operations. For normal VM's the private case would be a NOP.

Since kvm_tdp_mmu_map() already grabs private_root_hpa manually, it wouldn't change in this idea. I
don't know how much better it would be though. I think you are right we would have to create them
and compare.