Re: [RFC PATCH v2 41/69] KVM: x86: Add infrastructure for stolen GPA bits

Kai Huang <kai.huang@xxxxxxxxx> · Sat, 7 Aug 2021 10:00:06 +1200

On Fri, 6 Aug 2021 19:02:39 +0000 Sean Christopherson wrote:
> On Fri, Aug 06, 2021, Kai Huang wrote:
> > On Thu, 5 Aug 2021 16:06:41 +0000 Sean Christopherson wrote:
> > > On Thu, Aug 05, 2021, Kai Huang wrote:
> > > > On Fri, 2 Jul 2021 15:04:47 -0700 isaku.yamahata@xxxxxxxxx wrote:
> > > > > From: Rick Edgecombe <rick.p.edgecombe@xxxxxxxxx>
> > > > > @@ -2020,6 +2032,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > > > >  	sp = kvm_mmu_alloc_page(vcpu, direct);
> > > > >  
> > > > >  	sp->gfn = gfn;
> > > > > +	sp->gfn_stolen_bits = gfn_stolen_bits;
> > > > >  	sp->role = role;
> > > > >  	hlist_add_head(&sp->hash_link, sp_list);
> > > > >  	if (!direct) {
> > > > > @@ -2044,6 +2057,13 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > > > >  	return sp;
> > > > >  }
> > > > 
> > > > 
> > > > Sorry for replying old thread,
> > > 
> > > Ha, one month isn't old, it's barely even mature.
> > > 
> > > > but to me it looks weird to have gfn_stolen_bits
> > > > in 'struct kvm_mmu_page'.  If I understand correctly, above code basically
> > > > means that GFN with different stolen bit will have different 'struct
> > > > kvm_mmu_page', but in the context of this patch, mappings with different
> > > > stolen bits still use the same root,
> > > 
> > > You're conflating "mapping" with "PTE".  The GFN is a per-PTE value.  Yes, there
> > > is a final GFN that is representative of the mapping, but more directly the final
> > > GFN is associated with the leaf PTE.
> > > 
> > > TDX effectively adds the restriction that all PTEs used for a mapping must have
> > > the same shared/private status, so mapping and PTE are somewhat interchangeable
> > > when talking about stolen bits (the shared bit), but in the context of this patch,
> > > the stolen bits are a property of the PTE.
> > 
> > Yes it is a property of PTE, this is the reason that I think it's weird to have
> > stolen bits in 'struct kvm_mmu_page'. Shouldn't stolen bits in 'struct
> > kvm_mmu_page' imply that all PTEs (whether leaf or not) share the same
> > stolen bit?
> 
> No, the stolen bits are the property of the shadow page.  I'm using "PTE" above
> to mean "PTE for this shadow page", not PTEs within the shadow page, if that makes
> sense.

I see.

> 
> > > Back to your statement, it's incorrect.  PTEs (effectively mappings in TDX) with
> > > different stolen bits will _not_ use the same root.  kvm_mmu_get_page() includes
> > > the stolen bits in both the hash lookup and in the comparison, i.e. restores the
> > > stolen bits when looking for an existing shadow page at the target GFN.
> > > 
> > > @@ -1978,9 +1990,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > >                 role.quadrant = quadrant;
> > >         }
> > > 
> > > -       sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
> > > +       sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn_and_stolen)];
> > >         for_each_valid_sp(vcpu->kvm, sp, sp_list) {
> > > -               if (sp->gfn != gfn) {
> > > +               if ((sp->gfn | sp->gfn_stolen_bits) != gfn_and_stolen) {
> > >                         collisions++;
> > >                         continue;
> > >                 }
> > > 
> > 
> > This only works for non-root table, but there's only one single
> > vcpu->arch.mmu->root_hpa, we don't have an array to have one root for each
> > stolen bit, i.e. do a loop in mmu_alloc_direct_roots(), so effectively all
> > stolen bits share one single root.
> 
> Yes, and that's absolutely the required behavior for everything except for TDX
> with its two EPTPs.  E.g. any other implement _must_ reject CR3s that set stolen
> gfn bits.

OK.  I was thinking gfn_stolen_bits for 'struct kvm_mmu_page' for the table
pointed by CR3 should still make sense.

> 
> > > > which means gfn_stolen_bits doesn't make a lot of sense at least for root
> > > > page table. 
> > > 
> > > It does make sense, even without a follow-up patch.  In Rick's original series,
> > > stealing a bit for execute-only guest memory, there was only a single root.  And
> > > except for TDX, there can only ever be a single root because the shared EPTP isn't
> > > usable, i.e. there's only the regular/private EPTP.
> > > 
> > > > Instead, having gfn_stolen_bits in 'struct kvm_mmu_page' only makes sense in
> > > > the context of TDX, since TDX requires two separate roots for private and
> > > > shared mappings.
> > > 
> > > > So given we cannot tell whether the same root, or different roots should be
> > > > used for different stolen bits, I think we should not add 'gfn_stolen_bits' to
> > > > 'struct kvm_mmu_page' and use it to determine whether to allocate a new table
> > > > for the same GFN, but should use a new role (i.e role.private) to determine.
> > > 
> > > A new role would work, too, but it has the disadvantage of not automagically
> > > working for all uses of stolen bits, e.g. XO support would have to add another
> > > role bit.
> > 
> > For each purpose of particular stolen bit, a new role can be defined.  For
> > instance, in __direct_map(), if you see stolen bit is TDX shared bit, you don't
> > set role.private (otherwise set role.private).  For XO, if you see the stolen
> > bit is XO, you set role.xo.
> > 
> > We already have info of 'gfn_stolen_mask' in vcpu, so we just need to make sure
> > all code paths can find the actual stolen bit based on sp->role and vcpu (I
> > haven't gone through all though, assuming the annoying part is rmap).
> 
> Yes, and I'm not totally against the idea, but I'm also not 100% sold on it either,
> yet...  The idea of a 'private' flag is growing on me.
> 
> If we're treating the shared bit as an attribute bit, which we are, then it's
> effectively an extension of role.access.  Ditto for XO.
> 
> And looking at the code, I think it would be an improvement for TDX, as all of
> the is_private_gfn() calls that operate on a shadow page would be simplified and
> optimized as they wouldn't have to lookup both gfn_stolen_bits and the vcpu->kvm
> mask of the shared bit.
> 
> Actually, the more I think about it, the more I like it.  For TDX, there's no
> risk of increased hash collisions, as we've already done messed up if there's a
> shared vs. private collision.
> 
> And for XO, if it ever makes it way upstream, I think we should flat out disallow
> referencing XO addresses in non-leaf PTEs, i.e. make the XO permission bit reserved
> in non-leaf PTEs.  That would avoid any theoretical problems with the guest doing
> something stupid by polluting all its upper level PxEs with XO.  Collisions would
> be purely limited to the case where the guest is intentionally creating an alternate
> mapping, which should be a rare event (or the guest is comprosied, which is also
> hopefully a rare event).
> 
> 

My main motivation is 'gfn_stolen_bits' doesn't quite make sense for 'struct
kvm_mmu_page' for root, plus it seems it's a little bit redundant at first
glance.

So could we have your final suggestion? :)

Thanks,
-Kai