Re: [PATCH v7 23/27] KVM: arm64/sve: Add pseudo-register for the guest's vector lengths

Dave Martin <Dave.Martin@xxxxxxx> · Fri, 5 Apr 2019 13:54:13 +0100

On Fri, Apr 05, 2019 at 12:14:51PM +0200, Andrew Jones wrote:
> On Fri, Apr 05, 2019 at 10:36:03AM +0100, Dave Martin wrote:
> > On Thu, Apr 04, 2019 at 10:18:54PM +0200, Andrew Jones wrote:
> > > On Fri, Mar 29, 2019 at 01:00:48PM +0000, Dave Martin wrote:
> > > > This patch adds a new pseudo-register KVM_REG_ARM64_SVE_VLS to
> > > > allow userspace to set and query the set of vector lengths visible
> > > > to the guest.
> > > > 
> > > > In the future, multiple register slices per SVE register may be
> > > > visible through the ioctl interface.  Once the set of slices has
> > > > been determined we would not be able to allow the vector length set
> > > > to be changed any more, in order to avoid userspace seeing
> > > > inconsistent sets of registers.  For this reason, this patch adds
> > > > support for explicit finalization of the SVE configuration via the
> > > > KVM_ARM_VCPU_FINALIZE ioctl.
> > > > 
> > > > Finalization is the proper place to allocate the SVE register state
> > > > storage in vcpu->arch.sve_state, so this patch adds that as
> > > > appropriate.  The data is freed via kvm_arch_vcpu_uninit(), which
> > > > was previously a no-op on arm64.
> > > > 
> > > > To simplify the logic for determining what vector lengths can be
> > > > supported, some code is added to KVM init to work this out, in the
> > > > kvm_arm_init_arch_resources() hook.
> > > > 
> > > > The KVM_REG_ARM64_SVE_VLS pseudo-register is not exposed yet.
> > > > Subsequent patches will allow SVE to be turned on for guest vcpus,
> > > > making it visible.
> > > > 
> > > > Signed-off-by: Dave Martin <Dave.Martin@xxxxxxx>
> > > > Reviewed-by: Julien Thierry <julien.thierry@xxxxxxx>
> > > > Tested-by: zhang.lei <zhang.lei@xxxxxxxxxxxxxx>
> > > > 
> > > > ---
> > 
> > [...]
> > 
> > > > diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
> > > > index 2aa80a5..086ab05 100644
> > > > --- a/arch/arm64/kvm/guest.c
> > > > +++ b/arch/arm64/kvm/guest.c
> > > > @@ -206,6 +206,73 @@ static int set_core_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg)
> > > >  	return err;
> > > >  }
> > > >  
> > > > +#define vq_word(vq) (((vq) - SVE_VQ_MIN) / 64)
> > > > +#define vq_mask(vq) ((u64)1 << ((vq) - SVE_VQ_MIN) % 64)
> > > > +
> > > > +static bool vq_present(
> > > > +	const u64 (*const vqs)[DIV_ROUND_UP(SVE_VQ_MAX - SVE_VQ_MIN + 1, 64)],
> > > > +	unsigned int vq)
> > > > +{
> > > > +	return (*vqs)[vq_word(vq)] & vq_mask(vq);
> > > > +}
> > > > +
> > > > +static int get_sve_vls(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg)
> > > > +{
> > > > +	unsigned int max_vq, vq;
> > > > +	u64 vqs[DIV_ROUND_UP(SVE_VQ_MAX - SVE_VQ_MIN + 1, 64)];
> > > > +
> > > > +	if (WARN_ON(!sve_vl_valid(vcpu->arch.sve_max_vl)))
> > > > +		return -EINVAL;
> > > > +
> > > > +	memset(vqs, 0, sizeof(vqs));
> > > > +
> > > > +	max_vq = sve_vq_from_vl(vcpu->arch.sve_max_vl);
> > > > +	for (vq = SVE_VQ_MIN; vq <= max_vq; ++vq)
> > > > +		if (sve_vq_available(vq))
> > > > +			vqs[vq_word(vq)] |= vq_mask(vq);
> > > > +
> > > > +	if (copy_to_user((void __user *)reg->addr, vqs, sizeof(vqs)))
> > > > +		return -EFAULT;
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +static int set_sve_vls(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg)
> > > > +{
> > > > +	unsigned int max_vq, vq;
> > > > +	u64 vqs[DIV_ROUND_UP(SVE_VQ_MAX - SVE_VQ_MIN + 1, 64)];
> > > 
> > > There are three of these long 'DIV_ROUND_UP(SVE_VQ_MAX - SVE_VQ_MIN + 1, 64)'.
> > > A macro is probably warranted.
> > 
> > Fair point.  These are a bit cumbersome.  How about:
> > 
> > #define SVE_VLS_WORDS DIV_ROUND_UP(SVE_VQ_MAX - SVE_VQ_MIN + 1, 64)
> > 
> > Annoyingly, this is virtually identical to a Linux bitmap: the base type
> > is u64 instead of unsigned long; otherwise there's no intentional
> > difference.
> 
> Yeah, I noticed it was a reinvented bitmap, but I liked that it was,
> because, for the userspace api, removing the bitmap abstractions
> ensures we know what is what and where exactly it is.

Agreed.  I thought of using the bitmap API inside the kernel, but even
if this is arm64-specific code, it made me uneasy: the compiler doesn't
necessarily think that u64 and unsigned long are the same type even if
they're both 64-bit, so there's the potential for weird optimisations to
happen.

> > (Aside: do you know why the KVM API insists on everything being u64?
> > This makes sense for non-native types (like guest registers), but not
> > for native things like host userspace addresses etc.)
> 
> Would that work when the kernel is 64-bit and the userspace is 32? I
> personally like the interfaces being explicit with type size, as it
> removes the brain burden of translating long and int when moving arch
> to arch and kernel to userspace.

Ah yes, the joys of compat.  And ioctl.  Ignore me.

> > > > +
> > > > +	if (kvm_arm_vcpu_sve_finalized(vcpu))
> > > > +		return -EPERM; /* too late! */
> > > > +
> > > > +	if (WARN_ON(vcpu->arch.sve_state))
> > > > +		return -EINVAL;
> > > > +
> > > > +	if (copy_from_user(vqs, (const void __user *)reg->addr, sizeof(vqs)))
> > > > +		return -EFAULT;
> > > > +
> > > > +	max_vq = 0;
> > > > +	for (vq = SVE_VQ_MIN; vq <= SVE_VQ_MAX; ++vq)
> > > > +		if (vq_present(&vqs, vq))
> > > > +			max_vq = vq;
> > > > +
> > > > +	if (max_vq > sve_vq_from_vl(kvm_sve_max_vl))
> > > > +		return -EINVAL;
> > > > +
> > > > +	for (vq = SVE_VQ_MIN; vq <= max_vq; ++vq)
> > > > +		if (vq_present(&vqs, vq) != sve_vq_available(vq))
> > > > +			return -EINVAL;
> > > 
> > > What about supporting the optional non-power-of-2 vector lengths? For
> > > example, if the maximum VL is 512, then in addition to 512, 128, and
> > > 256 there could be a 384. If we plan to start a guest on a machine
> > > that has all four and then migrate it to a machine that only has
> > > 128,256,512 we would want to set the VL set to 128,256,512, even on
> > > the machine that supports 384. If I understand the above test correctly,
> > > then that wouldn't work.
> > 
> > This is exactly what the code is trying to forbid, though it may not be
> > obvious here why.
> > 
> > The trouble is, we can't correctly emulate a vcpu supporting
> > {128,256,512} on hardware that also supports 384-bit vectors: the
> > architecture says that the vector length you get is determined by
> > rounding ZCR_EL1.LEN down to the next vector length actually
> > supported.
> > 
> > So, the guest would expect writing ZCR_EL1.LEN = 2 to give a vector
> > length of 256 bits, whereas on this hardware the actual resulting
> > vector length would be 384 bits.
> 
> Oh, I see.
> 
> > 
> > sve_probe_vqs() relies on this property to detect the set of available
> > vector lengths.  Simple, bare-metal guests might also just set
> > ZCR_ELx.LEN = 0x1ff to just get the max available.
> > 
> > 
> > The architecture says categorically that the set of vector lengths
> > supported is a fixed property of the (emulated) hardware -- we can't
> > having this changing under the guest's feet.
> > 
> > Fixing this would require an archietctural way to filter out individual
> > vector lengths from the supported set, not just a way to clamp the
> > maximum (which is all ZCR_EL2.LEN gives us).
> > 
> > The general expectation is that sanely designed cluster will be
> > "homogeneous enough" and won't trigger this check -- it's here
> > just in case.
> 
> Data centers evolve as they expand and hardware is replaced. I wouldn't
> expect them to stay homogeneous for long, not without keeping them that
> way at substantial cost. This isn't the first thing that is forcing Arm
> virt data centers to be homogeneous (we still use qemu's '-cpu host'),
> but it's disappointing to have yet another one.

This may or may not be a problem in practive, since I suspect
implementations that support non-power-of-two vector lengths may be in
the minority anyway.

The ARM Fast Model and probably QEMU can do it, but people are less
likely to try to build a high-performance cluster out of those...

> > I attempt to capture this in api.txt, leaving the door open in case the
> > architecture gives a way to support this in future:
> > 
> >   | KVM_REG_ARM64_SVE_VLS
> >   | 
> >   | [...]
> >   | 
> >   | Apart from simply removing all vector lengths from the host set
> >   | that exceed some value, support for arbitrarily chosen sets of
> >   | vector lengths is hardware-dependent and may not be available.
> >   | Attempting to configure an invalid set of vector lengths via
> >   | KVM_SET_ONE_REG will fail with EINVAL.
> > 
> > Is this enough, or do you think more explanation is needed somewhere?
> 
> I saw that before, so I guess more is needed :) What you wrote above about
> how we can only shorten the list due to how zcr_el1.len works was the
> missing piece for me. Maybe that should be integrated into the text
> somehow.

If you think the above is enough for ABI documentation purposes, I will
aim to drop the following comment into set_sve_vls():

	/*
	 * Vector lengths supported by the host can't currently be
	 * hidden from the guest individually: instead we can only set a
	 * maxmium via ZCR_EL2.LEN.  So, make sure the available vector
	 * length match the set requested exactly up to the requested
	 * maximum:
	 */
	for (vq = SVE_VQ_MIN; vq <= max_vq; ++vq)
		if (vq_present(&vqs, vq) != sve_vq_available(vq))
			return -EINVAL;

Do you think that's enough?

Cheers
---Dave
_______________________________________________
kvmarm mailing list
kvmarm@xxxxxxxxxxxxxxxxxxxxx
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm