On Thu, Nov 23, 2017 at 06:06:34PM +0000, Dave Martin wrote: > On Wed, Nov 22, 2017 at 08:23:27PM +0100, Christoffer Dall wrote: > > On Fri, Nov 17, 2017 at 04:38:55PM +0000, Dave Martin wrote: > > > This patch is a flattened bunch of hacks for adding SVE support to > > > KVM. It's intended as a starting point for comments: it is not > > > intended to be complete or final! > > > > > > ** This patch has suspected bugs and has undergone minimal testing: do > > > not merge ** > > > > > > Notes: > > > > > > struct kvm_vpcu_arch does not currently include space for a guest's > > > SVE state, so supporting SVE in guests requires some additional > > > space to be allocated. Statically allocating space per-vcpu is > > > wasteful, especially if this allocation is based on the theoretical > > > future arch maximum vector length SVE_VL_MAX. > > > > > > A pointer to dynamically allocated memory would require that memory > > > to be mapped into hyp. Hyp mappings cannot currently be torn down > > > dynamically, so this would result in a mess of kernel heap memory > > > getting mapped into hyp over time. > > > > > > This patch adopts a compromise: enough space is allocated at the > > > end of each kvm_vcpu to store the SVE state, sized according to the > > > maximum vector length supported by the hardware. Naturally, if the > > > hardware does not support SVE, no extra space is allocated at all. > > > > > > Context switching implemented by adding alternative SVE code at > > > each site where FPSIMD context is handled. SVE is unconditionally > > > provided to the guest is the host supports it. This is a bit > > > crude, but good enough for a proof-of-concept implementation. > > > > > > ZCR_EL1 and ZCR_EL2 are added to the sys_regs list unconditionally, > > > which will break userspace snapshot/restore compatibility. > > > Possibly a more flexible approach is needed. The inclusion of > > > ZCR_EL2 here is a bit odd too: this is a feature configuration > > > control rather than a guest register -- it is used to clamp the > > > maximum vector length available to the guest. Currently it is just > > > set by default to correspond to the host's maximum. > > > > > > Questions > > > --------- > > > > > > * Should userspace be able to control the maximum SVE vector > > > length available to the guest, and what's the most natural way > > > to do it? > > > > Yes, we should do this. I think adding an attribute to the VCPU device > > is the most natural way to do this, see: > > > > Documentation/virtual/kvm/devices/vcpu.txt > > I'll take a look. > > > That way, you could read the maximum supported width and set the > > requested width. > > Is there an agreed policy on whether KVM supports heterogeneous compute > clusters? Let me put it this way: I don't think we want to make any decisions now that prevents us from supporting heterogeneous clusters in the future. This is something which is supported in part on x86, and I don't think ARM should do any less. However, I can't say that we've covered our bases with all other existing decisions, but we plan to support this in the future. > > If a vcpu is never expected to run on a node different from the one it > was created on, the ability to limit the max vector length available > to the guest may not be useful. > Even in that case, it could be useful for debug, test, and development. > > > > > > For example, it would be necessary to limit the vector length to > > > the lowest common denominator in order to support migration > > > across a cluster where the maximum hardware vector length > > > differs between nodes. > > > > > > * Combined alternatives are really awkward. Is there any way to > > > use the standard static key based features tests in hyp? > > > > Look at has_vhe(), that's a static key that we can use in hyp, and there > > were some lengthy discussions about why that was true in the past. We > > also use the vgic_v3_cpuif_trap etc. static keys, so take a look at what > > we do for those. > > OK, I'll take a look. > > > > TODO > > > ---- > > > > > > * Allow userspace feature control, to choose whether to expose SVE > > > to a guest. > > > > > > * Attempt to port some of the KVM entry code to C, at least for the > > > __fpsimd_guest_restore stuff. The extra complexity with SVE looks > > > unsustainable. > > > > This sounds like a good idea. > > Dang. ;) > > > > > > > * Figure out ABI for exposing SVE regs via the ioctl interface. > > > > > > *Bugs* > > > ------ > > > > > > Currently there is nothing stopping KVM userspace from > > > changing the guest's ZCR_EL2 after boot via the ioctl interface: > > > this breaks architectural assumptions in the guest, and should > > > really be forbidden. Also, this is a latent trigger for > > > buffer overruns, if creation of guests with limited VL is > > > someday permitted. > > > > Uerspace shouldn't be able to directly control ZCR_EL2, but ZCR_EL1. > > The length constraint for ZCR_EL2 should be exported via some more > > generic interface (see above) which prevents you from modifying it after > > a vcpu has run (we have other similar uses, see > > vcpu->arch.has_run_once), but ZCR_EL1 should be exported through the > > normal KVM_SET/GET_ONE_REG IOCTL interface, and similar to all other > > registers, if userspace tampers with it while the guest is running, it's > > not going to be pretty for the guest, but shouldn't affect the host. > > Agreed: putting ZCR_EL2 here is an ugly hack, but I think you've > furnished me with better ways of doing this now. > > > > > > > Signed-off-by: Dave Martin <Dave.Martin@xxxxxxx> > > > --- > > > arch/arm64/include/asm/fpsimdmacros.h | 8 +++++ > > > arch/arm64/include/asm/kvm_host.h | 30 ++++++++++++++++++ > > > arch/arm64/include/asm/kvm_hyp.h | 4 +++ > > > arch/arm64/include/asm/sysreg.h | 1 + > > > arch/arm64/kernel/asm-offsets.c | 8 +++++ > > > arch/arm64/kernel/entry-fpsimd.S | 1 - > > > arch/arm64/kvm/handle_exit.c | 2 +- > > > arch/arm64/kvm/hyp/entry.S | 60 ++++++++++++++++++++++++++++------- > > > arch/arm64/kvm/hyp/fpsimd.S | 12 +++++++ > > > arch/arm64/kvm/hyp/hyp-entry.S | 7 ++++ > > > arch/arm64/kvm/hyp/switch.c | 46 ++++++++++++++++++++++++++- > > > arch/arm64/kvm/reset.c | 18 +++++++++++ > > > arch/arm64/kvm/sys_regs.c | 39 ++++++++++++++++------- > > > virt/kvm/arm/arm.c | 12 +++++-- > > > 14 files changed, 221 insertions(+), 27 deletions(-) > > > > > > diff --git a/arch/arm64/include/asm/fpsimdmacros.h b/arch/arm64/include/asm/fpsimdmacros.h > > > index e050d76..e2124c8 100644 > > > --- a/arch/arm64/include/asm/fpsimdmacros.h > > > +++ b/arch/arm64/include/asm/fpsimdmacros.h > > > @@ -17,6 +17,12 @@ > > > * along with this program. If not, see <http://www.gnu.org/licenses/>. > > > */ > > > > > > +#ifndef __ARM64_FPSIMDMACROS_H > > > +#define __ARM64_FPSIMDMACROS_H > > > + > > > +#include <asm/assembler.h> > > > +#include <asm/sysreg.h> > > > + > > > .macro fpsimd_save state, tmpnr > > > stp q0, q1, [\state, #16 * 0] > > > stp q2, q3, [\state, #16 * 2] > > > @@ -223,3 +229,5 @@ > > > ldr w\nxtmp, [\xpfpsr, #4] > > > msr fpcr, x\nxtmp > > > .endm > > > + > > > +#endif /* ! __ARM64_FPSIMDMACROS_H */ > > > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h > > > index 674912d..7045682 100644 > > > --- a/arch/arm64/include/asm/kvm_host.h > > > +++ b/arch/arm64/include/asm/kvm_host.h > > > @@ -22,6 +22,7 @@ > > > #ifndef __ARM64_KVM_HOST_H__ > > > #define __ARM64_KVM_HOST_H__ > > > > > > +#include <linux/stddef.h> > > > #include <linux/types.h> > > > #include <linux/kvm_types.h> > > > #include <asm/cpufeature.h> > > > @@ -29,6 +30,7 @@ > > > #include <asm/kvm.h> > > > #include <asm/kvm_asm.h> > > > #include <asm/kvm_mmio.h> > > > +#include <asm/sigcontext.h> > > > > > > #define __KVM_HAVE_ARCH_INTC_INITIALIZED > > > > > > @@ -102,6 +104,8 @@ enum vcpu_sysreg { > > > SCTLR_EL1, /* System Control Register */ > > > ACTLR_EL1, /* Auxiliary Control Register */ > > > CPACR_EL1, /* Coprocessor Access Control */ > > > + ZCR_EL1, /* SVE Control */ > > > + ZCR_EL2, /* SVE Control (enforced by host) */ > > > > eh, this shouldn't be here (I don't suppose you plan on supporting > > nested virtualization support of SVE just yet ?), but should be > > described in the vcpu structure outside of the vcpu_sysreg array, which > > really only contains guest-visible values. > > Agreed, see above. > > > > > > TTBR0_EL1, /* Translation Table Base Register 0 */ > > > TTBR1_EL1, /* Translation Table Base Register 1 */ > > > TCR_EL1, /* Translation Control Register */ > > > @@ -202,6 +206,7 @@ struct kvm_vcpu_arch { > > > /* HYP configuration */ > > > u64 hcr_el2; > > > u32 mdcr_el2; > > > + u32 sve_ffr_offset; /* offset of stored SVE FFR from ctxt base */ > > > > > > /* Exception Information */ > > > struct kvm_vcpu_fault_info fault; > > > @@ -279,6 +284,31 @@ struct kvm_vcpu_arch { > > > bool has_run_once; > > > }; > > > > > > +/* > > > + * The size of SVE state is not known at compile time, so these helper > > > + * macros are used to address state appended to the end of struct > > > + * kvm_vcpu. > > > + * > > > + * There is currently no host SVE state, since a vcpu must run inside > > > + * syscall context, and any cached SVE state of a second thread is > > > > second thread? > > "an unrelated host task"? > right, ok. > > > + * explicitly flushed before vcpu entry. > > > > didn't you just change that in the previous patch to be after vcpu exit? > > Err, yes, that should say "after". > > > [...] > > > > diff --git a/arch/arm64/kvm/hyp/entry.S b/arch/arm64/kvm/hyp/entry.S > > > index 12ee62d..d8e8d22 100644 > > > --- a/arch/arm64/kvm/hyp/entry.S > > > +++ b/arch/arm64/kvm/hyp/entry.S > > > @@ -162,37 +162,75 @@ ENTRY(__fpsimd_guest_restore) > > > stp x2, x3, [sp, #-16]! > > > stp x4, lr, [sp, #-16]! > > > > > > + mrs x0, cptr_el2 > > > + > > > +alternative_if_not ARM64_SVE > > > + mov x1, #CPTR_EL2_TFP > > > + mov x2, #CPACR_EL1_FPEN > > > +alternative_else > > > + mov x1, #CPTR_EL2_TFP | CPTR_EL2_TZ > > > + mov x2, #CPACR_EL1_FPEN | CPACR_EL1_ZEN > > > +alternative_endif > > > + > > > alternative_if_not ARM64_HAS_VIRT_HOST_EXTN > > > - mrs x2, cptr_el2 > > > - bic x2, x2, #CPTR_EL2_TFP > > > - msr cptr_el2, x2 > > > + bic x0, x0, x1 > > > alternative_else > > > - mrs x2, cpacr_el1 > > > - orr x2, x2, #CPACR_EL1_FPEN > > > - msr cpacr_el1, x2 > > > + orr x0, x0, x2 > > > alternative_endif > > > + > > > + msr cptr_el2, x0 > > > isb > > > > > > - mrs x3, tpidr_el2 > > > + mrs x4, tpidr_el2 > > > > > > - ldr x0, [x3, #VCPU_HOST_CONTEXT] > > > + ldr x0, [x4, #VCPU_HOST_CONTEXT] > > > kern_hyp_va x0 > > > add x0, x0, #CPU_GP_REG_OFFSET(CPU_FP_REGS) > > > bl __fpsimd_save_state > > > > > > - add x2, x3, #VCPU_CONTEXT > > > + add x2, x4, #VCPU_CONTEXT > > > + > > > +#ifdef CONFIG_ARM64_SVE > > > +alternative_if ARM64_SVE > > > + b 2f > > > +alternative_else_nop_endif > > > +#endif > > > + > > > add x0, x2, #CPU_GP_REG_OFFSET(CPU_FP_REGS) > > > bl __fpsimd_restore_state > > > > > > +0: > > > // Skip restoring fpexc32 for AArch64 guests > > > mrs x1, hcr_el2 > > > tbnz x1, #HCR_RW_SHIFT, 1f > > > - ldr x4, [x3, #VCPU_FPEXC32_EL2] > > > - msr fpexc32_el2, x4 > > > + ldr x3, [x4, #VCPU_FPEXC32_EL2] > > > + msr fpexc32_el2, x3 > > > 1: > > > ldp x4, lr, [sp], #16 > > > ldp x2, x3, [sp], #16 > > > ldp x0, x1, [sp], #16 > > > > > > eret > > > + > > > +#ifdef CONFIG_ARM64_SVE > > > +2: > > > +alternative_if_not ARM64_HAS_VIRT_HOST_EXTN > > > + ldr x0, [x4, #VCPU_ZCR_EL2] > > > + ldr x3, [x4, #VCPU_ZCR_EL1] > > > + msr_s SYS_ZCR_EL2, x0 > > > +alternative_else > > > + ldr x0, [x4, #VCPU_ZCR_EL1] > > > + ldr x3, [x4, #VCPU_ZCR_EL2] > > > + msr_s SYS_ZCR_EL12, x0 > > > +alternative_endif > > > + > > > + ldr w0, [x4, #VCPU_SVE_FFR_OFFSET] > > > + add x0, x4, x0 > > > + add x1, x2, #VCPU_FPSR - VCPU_CONTEXT > > > + mov x2, x3 > > > + bl __sve_load_state > > > + > > > + b 0b > > > +#endif /* CONFIG_ARM64_SVE */ > > > > I think if we could move all of this out to C code that would be much > > nicer. Also, we currently do the short-circuit fault-in-fpsimd state > > inside hyp and always save the fpsimd state when returning to the host > > kernel, but in my optimization series I change this to leave the state > > in hardware until we schedule or return to userspace, so perhaps we can > > go all the way back to handle_exit() then without much performance loss, > > which may further simplify the SVE support? > > This is what I would be aiming for, coupled with awareness of whether > the host has any live FPSIMD/SVE state in the first place: for SVE > specifically the answer will _usually_ be "no". > > Ideally, we would give a vcpu its own SVE/FPSIMD context on a par with > host tasks and just reuse the fpsimd.c machinery rather than inventing > separate machinery for KVM -- the flipside would be that the hyp code > may need to mainpulate thread flags etc. -- I didn't want to rush to > this destination before learning to walk though. Fair enough, but then I think the "let's rely on VHE for SVE support" comes in, because you can mostly think of hyp code as being the host kernel code. > > For the first iteration of KVM SVE support I would rather not do that > rafactoring, to reduce the number of things I'm breaking at the same > time... > On the other hand, if it makes the SVE support code much simpler, it may be easier in the end. It may be worth thinking of a way we can measure the two approaches with basic FPSIMD so that we know the consequences of refactoring anything. > > > > The hit would obviously be a slightly higher latency on the first fault > > in the guest on SVE/fpsimd. > > For SVE it's probably not the end of the world since that's higher > cost to save/restore in the first place. > > For FPSIMD I've less of a feel for it. > Yeah, me neither. We'd have to measure something. > [...] > > > > +static hyp_alternate_select(__fpsimd_guest_save, > > > + __fpsimd_guest_save_nsve, __fpsimd_guest_save_sve, > > > + ARM64_SVE); > > > + > > > > How likely is it that we'll have SVE implementations without VHE? If > > completely unlikely, perhaps it's worth considering just disabling SVE > > on non-VHE systems. > > I did wonder about this, but the architectural position is that such > a configuration is allowed. > > I think that most SVE implementations will have VHE in practice, so > non-VHE might be supported as a later addition if supporting it is > invasive. But I'll see what it looks like after cleanup in line > with your other suggestions. Ok, sounds good. > > [...] > > > > diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c > > > index 3256b92..b105e54 100644 > > > --- a/arch/arm64/kvm/reset.c > > > +++ b/arch/arm64/kvm/reset.c > > > @@ -26,7 +26,9 @@ > > > > > > #include <kvm/arm_arch_timer.h> > > > > > > +#include <asm/cpufeature.h> > > > #include <asm/cputype.h> > > > +#include <asm/fpsimd.h> > > > #include <asm/ptrace.h> > > > #include <asm/kvm_arm.h> > > > #include <asm/kvm_asm.h> > > > @@ -88,6 +90,22 @@ int kvm_arch_dev_ioctl_check_extension(struct kvm *kvm, long ext) > > > return r; > > > } > > > > > > +size_t arch_kvm_vcpu_size(void) > > > +{ > > > + unsigned int vq; > > > + struct kvm_vcpu vcpu; /* dummy struct used for size calculation */ > > > + char *p; > > > + > > > + if (!system_supports_sve()) > > > + return sizeof(struct kvm_vcpu); > > > + > > > + BUG_ON(!sve_vl_valid(sve_max_vl)); > > > + vq = sve_vq_from_vl(sve_max_vl); > > > + > > > + p = vcpu_guest_sve_state(&vcpu) + vcpu_guest_sve_size(vq); > > > + return p - (char *)&vcpu; > > > > Why are we doing things this way as opposed to simply having a pointer > > to SVE state in the VCPU structure which we allocate as a separate > > structure when setting up a VM (and we can then choose to only allocate > > this memory if SVE is actually supported for the guest) ? > > > > This all seems very counter-intuitive. > > I disliked the idea of mapping random kernel heap memory into hyp, > but I guess we already do that, since vcpu kmem cache memory may > get recycled for other purposes. > > I guess I was trying to dodge the need to understang anything > about the hyp memory mapping, but as you can see here I failed > anyway. > > > Do you consider mapping random kernel heap into hyp acceptable? Yes, that's pretty much what we do for the kvm_cpu_context_t per CPU, I think. See how we use create_hyp_mappings(). > Ideally we would refcount mappings and remove any pages that > are no longer needed, but that's extra effort and may be best > dealt with separately. > We have so far taken the policy that we don't bother. Yes, some extra things may be mapped from hyp code, but everything is mapped from the host anyway, so I don't see a strong security hole here. If we happen to reuse some VA space for something else later on, we simply "overwrite" the mapping as needed. > > > +} > > > + > > > /** > > > * kvm_reset_vcpu - sets core registers and sys_regs to reset value > > > * @vcpu: The VCPU pointer > > > diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c > > > index 1830ebc..fb86907 100644 > > > --- a/arch/arm64/kvm/sys_regs.c > > > +++ b/arch/arm64/kvm/sys_regs.c > > > @@ -481,6 +481,29 @@ static void reset_pmcr(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r) > > > vcpu_sys_reg(vcpu, PMCR_EL0) = val; > > > } > > > > > > +static void reset_zcr_el2(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r) > > > +{ > > > + u64 val = 0; > > > + unsigned int vq; > > > + char *p; > > > + > > > + if (system_supports_sve()) { > > > + BUG_ON(!sve_vl_valid(sve_max_vl)); > > > > This seems to be checked more than one place in this patch. Is this not > > something that the kernel can verify somewhere else at boot as opposed > > to all over the place at run time? > > Treat this as debug. It should mostly go away. > > I wanted to sanity-check that SVE is fully probed in the host kernel > before we can get here. I think this is probably true by construction, > but for development purposes I'm being more paranoid. > ok, fair enough. I'll try to be less nit-picky. > > > > > + vq = sve_vq_from_vl(sve_max_vl); > > > + val = vq - 1; > > > + > > > + /* > > > + * It's a bit gross to do this here. But sve_ffr_offset is > > > + * essentially a cached form of ZCR_EL2 for use by the hyp asm > > > + * code, and I don't want to split the logic: > > > + */ > > > > This comment will be more concerning to people reading the code than > > helpful I think. > > > > IIUC, the point is that we can't know where the FFRs are because it's > > all dependent on the vector length, so the place where we initialize the > > vector length is where we set the pointer, right? > > Yes-ish. > > > That makes good sense, and thus makes it non-gross, but I wonder how > > this looks once we've parameterized the SVE support. > > This is code is really a proof-of-concept hack. What I really meant > here is that I don't like the way the code is currently factored -- > I expect to substantially rewrite it before a non-RFC posting. > > > Essentially, I think you're trying to initalize the per-VCPU KVM support > > for SVE from within a framework designed to initialize the VCPU's state, > > so I would expect that you have reset_zcr_el1 here, looking quite > > different, and no reset_zcr_el2 until you support nested virt with SVE > > for the guest hypervisor, as per my comment on the data structure. > > > > Also, if we can have a structure for the gust SVE state with a pointer > > from the vcpu struct, the FFR pointer can be embedded inside the struct, > > making it slightly more contained. > > Yes, that may work. Or I may find a cleaner way to deduce the pointer > at the appropriate time rather than having to cache an offset. > > The real cause of this is that the code to calculate the offset is C > but I need to use it from asm. If I port the affected parts of the > hyp entry code to C anyway this problem probably disappears. > Sounds good. Thanks, -Christoffer _______________________________________________ kvmarm mailing list kvmarm@xxxxxxxxxxxxxxxxxxxxx https://lists.cs.columbia.edu/mailman/listinfo/kvmarm