Re: [RFC PATCH 19/32] KVM: PPC: Book3S HV: Framework and hcall stubs for nested virtualization

David Gibson <david@xxxxxxxxxxxxxxxxxxxxx> · Thu, 27 Sep 2018 10:47:10 +1000

On Wed, Sep 26, 2018 at 09:16:09PM +1000, Paul Mackerras wrote:
> On Wed, Sep 26, 2018 at 03:19:00PM +1000, David Gibson wrote:
> > On Fri, Sep 21, 2018 at 08:01:50PM +1000, Paul Mackerras wrote:
> > >  
> > > +/* Platform-specific hcalls used for nested HV KVM */
> > > +#define H_SET_PARTITION_TABLE	0xF800
> > > +#define H_ENTER_NESTED		0xF804
> > 
> > So, these are in the platform specific hypercall range.  Do we expect
> > these to ever be PAPR standardized, or will they always be a
> > "vendor-specific" extension?
> > 
> > If the latter it might be more sensible to put them next to the
> > existing KVM/qemu defined hypercalls (e.g. H_RTAS) rather than closer
> > to the vendor-specific-but-implemented-by-phyp ones.
> 
> Interesting question.  It's possible they could be added to PAPR (and
> even not-impossibly implemented by pHyp one day, maybe).  I should
> ping the PAPR maintainer and see what he thinks.

Based on our discussion on the call today, I think where they are is
fine on reflection.

> 
> > > @@ -24,6 +24,23 @@
> > >  #include <asm/bitops.h>
> > >  #include <asm/book3s/64/mmu-hash.h>
> > >  
> > > +/* Structure for a nested guest */
> > 
> > Might make it easier to read that this represents a nested guest from
> > the PoV of the L0 hypervisor, rather than the L1 hypervisor.
> > 
> > Also, do these exist only in the true L0 host, or in any level of host
> > which has guest more than one level below itself?
> 
> The latter.  With multiple nesting, each level of hypervisor has a
> representation of every guest underneath it (which amounts to a struct
> kvm_nested_guest and a shadow page table).

> > > +struct kvm_nested_guest {
> > > +	struct kvm *parent;		/* L1 VM that owns this nested guest */
> > 
> > "parent" might not be the best name.  That suggests it represents the
> > hypervisor immediately above this nested guest.  But AFAICT, if this
> > is a multiply nested guest, then this will be the immediate guest of
> > *this* host which indirectly owns the nested guest.  Maybe "l1_host" ?
> 
> Well, since every Ln guest is a guest of Lm for all 0 <= m < n, that
> means that L0 only sees L1 and L2.  If there is an L3 or L4 they look
> identical to L2 from L0's point of view.  In other words, L0 does
> regard L1 as owning all the Ln guests under it for n >= 2.  But we can
> change the name if you think it's clearer.

Ok.  I feel like we need better terminology for "one level below us",
"one level above us" etc. that distinguish from absolute levels like
L0, L1 etc.  Good options don't immediately occur to me.

> > > +struct kvm_nested_guest *kvmhv_get_nested(struct kvm *kvm, int lpid,
> > 
> > Might be good to rename the 'lpid' parameter to make it clearer if
> > this takes the L1 or L0 value of the lpid.
> 
> OK.  It's the l1 lpid.
> 
> > > +	u64 l1_ptcr;
> > > +	int max_nested_lpid;
> > > +	struct kvm_nested_guest *nested_guests[KVM_MAX_NESTED_GUESTS];
> > 
> > kvm_nested_guest includes a next pointer.  Is this intended to be an
> > array, a linked list, or something more complex?
> 
> It's a temporary linked list used while freeing all guests to avoid
> dropping and re-taking the mmu_lock.

Ok, maybe a comment on the next pointer saying it's just used for the
free list.

> > > +/* Only called when we're not in hypervisor mode */
> > > +bool kvmhv_nested_init(void)
> > > +{
> > > +	long int ptb_order;
> > > +	unsigned long ptcr;
> > > +	long rc;
> > > +
> > > +	if (!radix_enabled())
> > > +		return false;
> > > +
> > > +	/* find log base 2 of KVMPPC_NR_LPIDS, rounding up */
> > > +	ptb_order = __ilog2(KVMPPC_NR_LPIDS - 1) + 1;
> > > +	if (ptb_order < 8)
> > > +		ptb_order = 8;
> > > +	pseries_partition_tb = kmalloc(sizeof(struct patb_entry) << ptb_order,
> > > +				       GFP_KERNEL);
> > > +	if (!pseries_partition_tb) {
> > > +		pr_err("kvm-hv: failed to allocated nested partition table\n");
> > > +		return false;
> > > +	}
> > 
> > Would it make sense to have a knob allowing the L0 to limit how many
> > nested guests each L1 can have (rather than just "0" or "some")?  If
> > so, would it then also make sense to advertise that to the L1 and have
> > it allocate its partition table accordingly?
> 
> Maybe? Probably? Though then we get into having to have that as a
> capability with a minimum required value across a migration domain so
> that an L1 guest running on one host doesn't suddenly die, or have
> some of its nested guests die, when it gets migrated.
> 
> We do already have a limit introduced in this patch of
> KVMPPC_NR_LPIDS, which is currently 1024.  If we ignore POWER7 then we
> can support up to 4095 guests at L0.
> 
> Let me know if you think we want a user-controllable limit on nested
> guests per L1 guest, and if so how that would work in a migration
> domain.

As discussed on the call, we can look at this as a later refinement.

> > > +
> > > +	ptcr = __pa(pseries_partition_tb) | (ptb_order - 8);
> > > +	rc = plpar_hcall_norets(H_SET_PARTITION_TABLE, ptcr);
> > > +	if (rc != H_SUCCESS) {
> > > +		pr_err("kvm-hv: hypervisor does not support nesting (rc=%ld)\n",
> > > +		       rc);
> > 
> > Might want to make it clearer that this is about the hypervisor
> > *above* this kernel not supporting nesting, rather than the hypervisor
> > *in* this kernel not supporting nesting.
> 
> OK.  "our hypervisor..." perhaps? or what?

Yeah, now I come to think of it, making this clearer is kind of hard.

"Parent hypervisor..." maybe?

> > > +void kvmhv_put_nested(struct kvm_nested_guest *gp)
> > > +{
> > > +	struct kvm *kvm = gp->parent;
> > > +	long ref;
> > > +
> > > +	spin_lock(&kvm->mmu_lock);
> > > +	ref = --gp->refcnt;
> > > +	spin_unlock(&kvm->mmu_lock);
> > > +	if (ref == 0)
> > > +		kvmhv_release_nested(gp);
> > > +}
> > > +
> > > +struct kvm_nested_guest *kvmhv_find_nested(struct kvm *kvm, int lpid)
> > 
> > Why do you need this as well as get_nested()?  Plus this is a
> > non-static function that hasn't been added to any headers.
> 
> It's for use in the TLB invalidation functions that get added later,
> and we add "static" at that point.  I could just add the function with
> its first use.

Sounds like an idea.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson
Attachment:
signature.asc

Description: PGP signature