Re: [RFC PATCH 19/32] KVM: PPC: Book3S HV: Framework and hcall stubs for nested virtualization

Paul Mackerras <paulus@xxxxxxxxxx> · Wed, 26 Sep 2018 21:16:09 +1000

On Wed, Sep 26, 2018 at 03:19:00PM +1000, David Gibson wrote:
> On Fri, Sep 21, 2018 at 08:01:50PM +1000, Paul Mackerras wrote:
> >  
> > +/* Platform-specific hcalls used for nested HV KVM */
> > +#define H_SET_PARTITION_TABLE	0xF800
> > +#define H_ENTER_NESTED		0xF804
> 
> So, these are in the platform specific hypercall range.  Do we expect
> these to ever be PAPR standardized, or will they always be a
> "vendor-specific" extension?
> 
> If the latter it might be more sensible to put them next to the
> existing KVM/qemu defined hypercalls (e.g. H_RTAS) rather than closer
> to the vendor-specific-but-implemented-by-phyp ones.

Interesting question.  It's possible they could be added to PAPR (and
even not-impossibly implemented by pHyp one day, maybe).  I should
ping the PAPR maintainer and see what he thinks.

> > @@ -24,6 +24,23 @@
> >  #include <asm/bitops.h>
> >  #include <asm/book3s/64/mmu-hash.h>
> >  
> > +/* Structure for a nested guest */
> 
> Might make it easier to read that this represents a nested guest from
> the PoV of the L0 hypervisor, rather than the L1 hypervisor.
> 
> Also, do these exist only in the true L0 host, or in any level of host
> which has guest more than one level below itself?

The latter.  With multiple nesting, each level of hypervisor has a
representation of every guest underneath it (which amounts to a struct
kvm_nested_guest and a shadow page table).

> > +struct kvm_nested_guest {
> > +	struct kvm *parent;		/* L1 VM that owns this nested guest */
> 
> "parent" might not be the best name.  That suggests it represents the
> hypervisor immediately above this nested guest.  But AFAICT, if this
> is a multiply nested guest, then this will be the immediate guest of
> *this* host which indirectly owns the nested guest.  Maybe "l1_host" ?

Well, since every Ln guest is a guest of Lm for all 0 <= m < n, that
means that L0 only sees L1 and L2.  If there is an L3 or L4 they look
identical to L2 from L0's point of view.  In other words, L0 does
regard L1 as owning all the Ln guests under it for n >= 2.  But we can
change the name if you think it's clearer.

> > +struct kvm_nested_guest *kvmhv_get_nested(struct kvm *kvm, int lpid,
> 
> Might be good to rename the 'lpid' parameter to make it clearer if
> this takes the L1 or L0 value of the lpid.

OK.  It's the l1 lpid.

> > +	u64 l1_ptcr;
> > +	int max_nested_lpid;
> > +	struct kvm_nested_guest *nested_guests[KVM_MAX_NESTED_GUESTS];
> 
> kvm_nested_guest includes a next pointer.  Is this intended to be an
> array, a linked list, or something more complex?

It's a temporary linked list used while freeing all guests to avoid
dropping and re-taking the mmu_lock.

> > +/* Only called when we're not in hypervisor mode */
> > +bool kvmhv_nested_init(void)
> > +{
> > +	long int ptb_order;
> > +	unsigned long ptcr;
> > +	long rc;
> > +
> > +	if (!radix_enabled())
> > +		return false;
> > +
> > +	/* find log base 2 of KVMPPC_NR_LPIDS, rounding up */
> > +	ptb_order = __ilog2(KVMPPC_NR_LPIDS - 1) + 1;
> > +	if (ptb_order < 8)
> > +		ptb_order = 8;
> > +	pseries_partition_tb = kmalloc(sizeof(struct patb_entry) << ptb_order,
> > +				       GFP_KERNEL);
> > +	if (!pseries_partition_tb) {
> > +		pr_err("kvm-hv: failed to allocated nested partition table\n");
> > +		return false;
> > +	}
> 
> Would it make sense to have a knob allowing the L0 to limit how many
> nested guests each L1 can have (rather than just "0" or "some")?  If
> so, would it then also make sense to advertise that to the L1 and have
> it allocate its partition table accordingly?

Maybe? Probably? Though then we get into having to have that as a
capability with a minimum required value across a migration domain so
that an L1 guest running on one host doesn't suddenly die, or have
some of its nested guests die, when it gets migrated.

We do already have a limit introduced in this patch of
KVMPPC_NR_LPIDS, which is currently 1024.  If we ignore POWER7 then we
can support up to 4095 guests at L0.

Let me know if you think we want a user-controllable limit on nested
guests per L1 guest, and if so how that would work in a migration
domain.

> > +
> > +	ptcr = __pa(pseries_partition_tb) | (ptb_order - 8);
> > +	rc = plpar_hcall_norets(H_SET_PARTITION_TABLE, ptcr);
> > +	if (rc != H_SUCCESS) {
> > +		pr_err("kvm-hv: hypervisor does not support nesting (rc=%ld)\n",
> > +		       rc);
> 
> Might want to make it clearer that this is about the hypervisor
> *above* this kernel not supporting nesting, rather than the hypervisor
> *in* this kernel not supporting nesting.

OK.  "our hypervisor..." perhaps? or what?

> > +void kvmhv_put_nested(struct kvm_nested_guest *gp)
> > +{
> > +	struct kvm *kvm = gp->parent;
> > +	long ref;
> > +
> > +	spin_lock(&kvm->mmu_lock);
> > +	ref = --gp->refcnt;
> > +	spin_unlock(&kvm->mmu_lock);
> > +	if (ref == 0)
> > +		kvmhv_release_nested(gp);
> > +}
> > +
> > +struct kvm_nested_guest *kvmhv_find_nested(struct kvm *kvm, int lpid)
> 
> Why do you need this as well as get_nested()?  Plus this is a
> non-static function that hasn't been added to any headers.

It's for use in the TLB invalidation functions that get added later,
and we add "static" at that point.  I could just add the function with
its first use.

Paul.