On Thu, Aug 2, 2012 at 3:54 PM, Christoffer Dall <c.dall@xxxxxxxxxxxxxxxxxxxxxx> wrote: > On Thu, Aug 2, 2012 at 3:34 PM, Antonios Motakis > <a.motakis@xxxxxxxxxxxxxxxxxxxxxx> wrote: >> On Thu, Aug 2, 2012 at 11:05 AM, Christoffer Dall >> <c.dall@xxxxxxxxxxxxxxxxxxxxxx> wrote: >>> >>> On Thu, Aug 2, 2012 at 1:14 AM, Antonios Motakis >>> <a.motakis@xxxxxxxxxxxxxxxxxxxxxx> wrote: >>> > Hello, >>> > >>> > On Thu, Aug 2, 2012 at 12:12 AM, Christoffer Dall >>> > <c.dall@xxxxxxxxxxxxxxxxxxxxxx> wrote: >>> >> >>> >> On Fri, Jul 27, 2012 at 1:10 AM, Antonios Motakis >>> >> <a.motakis@xxxxxxxxxxxxxxxxxxxxxx> wrote: >>> >> > This patch applies to the v9 patch series of KVM, >>> >> > i.e. commit d5321dceeaccf756755e76b38d8b5905bd99d250 >>> >> > >>> >> > In this patch, we use the Hyp Coprocessor Trap Register >>> >> > (HPCTR) to trap VFP/NEON instructions, and switch the FPU >>> >> > state at that point. After a guest exit, the VFP state is >>> >> > returned to the host. When disabling access to floating point >>> >> > instructions, we also mask FPEXC_EN in order to avoid the guest >>> >> > receiving Undefined instruction exceptions before we have a >>> >> > chance to switch back the floating point state. >>> >> > >>> >> > Initial lazy switching implementation provided by Rusty Russell. >>> >> > >>> >> > In this version of the patch, FPEXC_EN is no longer clobbered for >>> >> > the host; we save FPEXC before we mask it. >>> >> > >>> >> > We are reusing vfp_hard_struct, so now we depend on VFPv3 being >>> >> > enabled in the host kernel. For now, if the kernel has not been >>> >> > built with VFPv3, we don't build the VFP switching code in KVM, >>> >> > however we still trap cp10 and cp11 in order to inject an undefined >>> >> > instruction exception whenever the guest tries to use VFP/NEON. >>> >> > >>> >> > Alternatives to this would be to either require VFPv3 for KVM in >>> >> > the first place, or to implement this as a feature that can be >>> >> > switched >>> >> > on and off. If we don't mind the #ifdefs we can keep the current >>> >> > solution. >>> >> > >>> >> > For now I don't reuse vfp_save_state. If we still want to do that, we >>> >> > would have to change it a bit and also add a vfp_restore_state, as >>> >> > already mentioned by reviewers. However I have noticed that >>> >> > vfp_save_state >>> >> > doesn't clear FPEXC_EX before reading the rest of the registers, as >>> >> > we do according to the ARM ARM p. AppxF-2388. Can someone comment on >>> >> > whether this is a bug in vfp_save_state? >>> >> > >>> >> > Changes since v2: >>> >> > * Fix clobbering the host's FPEXC_EN on entry >>> >> > * Reuse vfp_hard_struct from asm/fpstate.h >>> >> > * Reuse macros from asm/vfpmacros.h (warning: out of date mnemonics) >>> >> > * Disable VFP support if kernel hasn't been cofigured with VFPv3 >>> >> > * Tweak and reuse set_hcptr instead of setting the HCPTR manually >>> >> > >>> >> > Changes since v1: >>> >> > * Lazy switching by Rusty Russell >>> >> > * Lazy switching fix; mask FPEXC_EN so guest processes don't crash >>> >> > * Simpler lazy switching handler >>> >> > * Replaced instruction mnemonics with coprocessor operations so the >>> >> > code can be built on gcc configurations with no full VFP-d32 >>> >> > support >>> >> > * VFP subarchitecture handling fix; disable FPEXC_EX before >>> >> > switching >>> >> > cp11 registers >>> >> > >>> >> > Signed-off-by: Rusty Russell <rusty.russell@xxxxxxxxxx> >>> >> > Signed-off-by: Antonios Motakis <a.motakis@xxxxxxxxxxxxxxxxxxxxxx> >>> >> > --- >>> >> > arch/arm/include/asm/kvm_host.h | 6 +++ >>> >> > arch/arm/kernel/asm-offsets.c | 2 + >>> >> > arch/arm/kvm/interrupts.S | 104 >>> >> > ++++++++++++++++++++++++++++++++++++--- >>> >> > 3 files changed, 104 insertions(+), 8 deletions(-) >>> >> > >>> >> > diff --git a/arch/arm/include/asm/kvm_host.h >>> >> > b/arch/arm/include/asm/kvm_host.h >>> >> > index 0c7e782..63d5cc1 100644 >>> >> > --- a/arch/arm/include/asm/kvm_host.h >>> >> > +++ b/arch/arm/include/asm/kvm_host.h >>> >> > @@ -19,6 +19,8 @@ >>> >> > #ifndef __ARM_KVM_HOST_H__ >>> >> > #define __ARM_KVM_HOST_H__ >>> >> > >>> >> > +#include <asm/fpstate.h> >>> >> > + >>> >> > #define KVM_MAX_VCPUS 4 >>> >> > #define KVM_MEMORY_SLOTS 32 >>> >> > #define KVM_PRIVATE_MEM_SLOTS 4 >>> >> > @@ -103,6 +105,10 @@ struct kvm_vcpu_arch { >>> >> > /* System control coprocessor (cp15) */ >>> >> > u32 cp15[nr_cp15_regs]; >>> >> > >>> >> > + /* Floating point registers (VFP and Advanced SIMD/NEON) */ >>> >> > + struct vfp_hard_struct vfp_guest; >>> >> > + struct vfp_hard_struct vfp_host; >>> >> > + >>> >> > /* Exception Information */ >>> >> > u32 hsr; /* Hyp Syndrom Register */ >>> >> > u32 hdfar; /* Hyp Data Fault Address Register */ >>> >> > diff --git a/arch/arm/kernel/asm-offsets.c >>> >> > b/arch/arm/kernel/asm-offsets.c >>> >> > index 9c76b53..d9ed793 100644 >>> >> > --- a/arch/arm/kernel/asm-offsets.c >>> >> > +++ b/arch/arm/kernel/asm-offsets.c >>> >> > @@ -168,6 +168,8 @@ int main(void) >>> >> > DEFINE(VCPU_TID_URW, offsetof(struct kvm_vcpu, >>> >> > arch.cp15[c13_TID_URW])); >>> >> > DEFINE(VCPU_TID_URO, offsetof(struct kvm_vcpu, >>> >> > arch.cp15[c13_TID_URO])); >>> >> > DEFINE(VCPU_TID_PRIV, offsetof(struct kvm_vcpu, >>> >> > arch.cp15[c13_TID_PRIV])); >>> >> > + DEFINE(VCPU_VFP_GUEST, offsetof(struct kvm_vcpu, >>> >> > arch.vfp_guest)); >>> >> > + DEFINE(VCPU_VFP_HOST, offsetof(struct kvm_vcpu, >>> >> > arch.vfp_host)); >>> >> > DEFINE(VCPU_REGS, offsetof(struct kvm_vcpu, >>> >> > arch.regs)); >>> >> > DEFINE(VCPU_USR_REGS, offsetof(struct kvm_vcpu, >>> >> > arch.regs.usr_regs)); >>> >> > DEFINE(VCPU_SVC_REGS, offsetof(struct kvm_vcpu, >>> >> > arch.regs.svc_regs)); >>> >> > diff --git a/arch/arm/kvm/interrupts.S b/arch/arm/kvm/interrupts.S >>> >> > index fd7331c..3159ce7 100644 >>> >> > --- a/arch/arm/kvm/interrupts.S >>> >> > +++ b/arch/arm/kvm/interrupts.S >>> >> > @@ -23,6 +23,7 @@ >>> >> > #include <asm/asm-offsets.h> >>> >> > #include <asm/kvm_asm.h> >>> >> > #include <asm/kvm_arm.h> >>> >> > +#include <asm/vfpmacros.h> >>> >> > >>> >> > #define VCPU_USR_REG(_reg_nr) (VCPU_USR_REGS + (_reg_nr * 4)) >>> >> > #define VCPU_USR_SP (VCPU_USR_REG(13)) >>> >> > @@ -89,6 +90,49 @@ ENTRY(__kvm_flush_vm_context) >>> >> > bx lr >>> >> > ENDPROC(__kvm_flush_vm_context) >>> >> > >>> >> > +/* Clobbers {r2-r7} */ >>> >> > +.macro store_vfp_state vfp_off, vcpup >>> >> > + add r6, \vcpup, \vfp_off >>> >> > + >>> >> > + @ The VFPFMRX and VFPFMXR macros are the VMRS and VMSR >>> >> > instructions >>> >> > + vfpfmrx r2, FPEXC >>> >> >>> >> nit: since they're macros and to be consistent, let's use VFPFMRX >>> >> instead >>> >> >>> > >>> > Ack. >>> > >>> >> >>> >> > + @ Make sure VFP is enabled so we can touch the registers. >>> >> > + orr r7, r2, #FPEXC_EN >>> >> > + vfpfmxr FPEXC, r7 >>> >> > + >>> >> > + vfpfmrx r3, FPSCR >>> >> > + tst r2, #FPEXC_EX @ Check for VFP >>> >> > Subarchitecture >>> >> >>> >> I think you need to first write 1 to the EX bit, which might mess with >>> >> the state, since the host could have written 0 there beforehand, in >>> >> which case it should still read-back as zero. You may want to check >>> >> all this before hand (it doesn't change once we're on certain >>> >> hardware). >>> > >>> > >>> > According to the VFP subarchitecture the EX bit needs to be cleared to >>> > 0, >>> > after saving the subarchitecture information. The version we store is in >>> > r2, >>> > which is the value stored before we meddle with it, which is the one the >>> > guest expects. In practice we can't really test this code path, because >>> > I've >>> > never seen this bit enabled in the Fast Models; the subarchitecture is >>> > maybe >>> > not even implemented. Interestingly the Linux kernel itself does not >>> > handle >>> > this at all in its own context switch code. >>> > >>> >>> eh, on page AppxF-2394 it says: >>> >>> Set FPEXC.EX=1 and FPEXC.FP2V=1 >>> Read back the FPEXC register >>> if FPEXC.EX == 0 then >>> Neither FPINST nor FPINST2 are implemented >>> ... >> >> >> Actually this is not necessary, nor the simplest way to do it. You will use >> this if you want to have two alternative context switch functions, one >> including FPINST and FPINST2, and one not including them. In that case you >> will do something like this at boot to determine which context switch >> function you need to use (see p. AppxF-2388). However the method we use is >> still valid, since if the subarchitecture is not implemented then FPEXC_EX >> will always be zero. This is exactly what the host kernel does; it never >> actually checks if the subarchitecture is implemented, only if there are >> extra registers to save. >> >> Of course we could consider the overhead of pointing to a different context >> switch function each time, vs the branch we use. However this is the same >> approach the host kernel uses, so I have no reason to think it is not >> efficient enough. >> > > ok, as we discussed offline, the point is rather that reads of FPINST > and FPINST2 are unpredictable when FPEXC.EX is not set. Thanks. > >>> >>> >>> >>> >> >>> >> >>> >> You could also make the operations execute conditionally and avoid the >>> >> potential pipeline flushes for mispredicted branches... >>> > >>> > >>> > Ack. We can do the FPINST2 operation conditional, however the first one >>> > skips a number of instructions so maybe it's cleaner to just branch >>> > >>> >> >>> >> >>> >> Additionally you can consider only performing the store of FPINST and >>> >> FPINST2 conditionally as well, not sure about the performance tradeoff >>> >> vs. code size though. >>> >> >>> > >>> > If EX is 0, then we also need to skip checking FP2V. >>> > >>> >>> true >>> >>> >> >>> >> > + beq 1f >>> >> > + vfpfmrx r4, FPINST >>> >> > + tst r2, #FPEXC_FP2V >>> >> > + beq 2f >>> >> > + vfpfmrx r5, FPINST2 >>> >> > +2: >>> >> > + bic r7, r2, #FPEXC_EX @ FPEXC_EX disable >>> >> > + vfpfmxr FPEXC, r7 >>> >> > +1: >>> >> > + vfpfstmia r6, r7 @ Save VFP registers >>> >> > + stm r6, {r2-r5} @ Save FPEXC, FPSCR, FPINST, >>> >> > FPINST2 >>> >> > +.endm >>> >> > + >>> >> > +/* Assume FPEXC_EN is on and FPEXC_EX is off, clobbers {r2-r7} */ >>> >> > +.macro restore_vfp_state vfp_off, vcpup >>> >> > + add r6, \vcpup, \vfp_off >>> >> > + >>> >> > + vfpfldmia r6, r7 @ Load VFP registers >>> >> > + ldm r6, {r2-r5} @ Load FPEXC, FPSCR, FPINST, >>> >> > FPINST2 >>> >> > + >>> >> > + vfpfmxr FPSCR, r3 >>> >> > + tst r2, #FPEXC_EX @ Check for VFP >>> >> > Subarchitecture >>> >> > + beq 1f >>> >> > + vfpfmxr FPINST, r4 >>> >> > + tst r2, #FPEXC_FP2V >>> >> > + beq 1f >>> >> >>> >> again with setting EX bit first here, so you probably want to check >>> >> another pre-stored value, as mentioned above. >>> >> >>> >> also I would like to reduce the number of branches here by using >>> >> conditionals. >>> >> >>> >> > + vfpfmxr FPINST2, r5 >>> >> > +1: >>> >> > + vfpfmxr FPEXC, r2 @ FPEXC (last, in case !EN) >>> >> > +.endm >>> >> > + >>> >> > >>> >> > @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ >>> >> > @ Hypervisor world-switch code >>> >> > >>> >> > @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ >>> >> > @@ -281,16 +325,16 @@ ENDPROC(__kvm_flush_vm_context) >>> >> > .endm >>> >> > >>> >> > /* Configures the HCPTR (Hyp Coprocessor Trap Register) on >>> >> > entry/return >>> >> > - * (hardware reset value is 0) */ >>> >> > -.macro set_hcptr entry >>> >> > + * (hardware reset value is 0). Keep previous value in r2. */ >>> >> > +.macro set_hcptr entry, mask >>> >> > mrc p15, 4, r2, c1, c1, 2 >>> >> > - ldr r3, =(HCPTR_TTA) >>> >> > + ldr r3, =\mask >>> >> > .if \entry == 1 >>> >> > - orr r2, r2, r3 @ Trap some coproc-accesses >>> >> > + orr r3, r2, r3 @ Trap some coproc-accesses >>> >> > .else >>> >> > - bic r2, r2, r3 @ Don't trap any coproc- >>> >> > accesses >>> >> > + bic r3, r2, r3 @ Don't trap any coproc- >>> >> > accesses >>> >> >>> >> the comments should be changed to something like: >>> >> >>> >> Trap coproc-accesses >>> >> >>> >> and >>> >> >>> >> Don't trap coproc accesses >>> >> >>> >> the "any" part is clearly wrong now. >>> > >>> > >>> > Ack. >>> > >>> >> >>> >> >>> >> > .endif >>> >> > - mcr p15, 4, r2, c1, c1, 2 >>> >> > + mcr p15, 4, r3, c1, c1, 2 >>> >> > .endm >>> >> > >>> >> > /* Enable/Disable: stage-2 trans., trap interrupts, trap wfi, trap >>> >> > smc >>> >> > */ >>> >> > @@ -328,6 +372,14 @@ ENTRY(__kvm_vcpu_run) >>> >> > read_cp15_state >>> >> > write_cp15_state 1, r0 >>> >> > >>> >> > +#ifdef CONFIG_VFPv3 >>> >> >>> >> maybe a comment here about the discussion for CONFIG_VFPv3, this >>> >> define seems oddly out of place if you didn't follow the e-mail >>> >> thread. >>> > >>> > >>> > Ack. >>> > >>> >> >>> >> >>> >> > + @ Mask FPEXC_EN so the guest doesn't trap floating point >>> >> > instructions >>> >> >>> >> s/Mask/Set/ ? >>> > >>> > >>> > Ack >>> > >>> >> >>> >> >>> >> >>> >> > + vfpfmrx r2, FPEXC @ VMRS >>> >> > + push {r2} >>> >> > + orr r2, r2, #FPEXC_EN >>> >> > + vfpfmxr FPEXC, r2 @ VMSR >>> >> > +#endif >>> >> > + >>> >> > push {r0} @ Push the VCPU pointer >>> >> > >>> >> > @ Configure Hyp-role >>> >> > @@ -335,7 +387,7 @@ ENTRY(__kvm_vcpu_run) >>> >> > >>> >> > @ Trap coprocessor CRx accesses >>> >> > set_hstr 1 >>> >> > - set_hcptr 1 >>> >> > + set_hcptr 1, (HCPTR_TTA | HCPTR_TCP(10) | HCPTR_TCP(11)) >>> >> > >>> >> > @ Write configured ID register into MIDR alias >>> >> > ldr r1, [r0, #VCPU_MIDR] >>> >> > @@ -395,7 +447,22 @@ __kvm_vcpu_return: >>> >> > >>> >> > @ Don't trap coprocessor accesses for host kernel >>> >> > set_hstr 0 >>> >> > - set_hcptr 0 >>> >> > + set_hcptr 0, (HCPTR_TTA | HCPTR_TCP(10) | HCPTR_TCP(11)) >>> >> > + >>> >> > +#ifdef CONFIG_VFPv3 >>> >> > + @ Save floating point registers we if let guest use them. >>> >> > + tst r2, #(HCPTR_TCP(10) | HCPTR_TCP(11)) >>> >> > + bne after_vfp_restore >>> >> >>> >> nice! >>> >> >>> >> > + >>> >> > + @ Switch VFP/NEON hardware state to the host's >>> >> > + store_vfp_state #VCPU_VFP_GUEST, r1 >>> >> > + restore_vfp_state #VCPU_VFP_HOST, r1 >>> >> > + >>> >> > +after_vfp_restore: >>> >> > + @ Restore FPEXC_EN which we clobbered on entry >>> >> > + pop {r2} >>> >> > + vfpfmxr FPEXC, r2 >>> >> > +#endif >>> >> > >>> >> > @ Reset Hyp-role >>> >> > configure_hyp_role 0, r1 >>> >> > @@ -587,6 +654,10 @@ hyp_hvc: >>> >> > @ Check syndrome register >>> >> > mrc p15, 4, r0, c5, c2, 0 @ HSR >>> >> > lsr r1, r0, #HSR_EC_SHIFT >>> >> > +#ifdef CONFIG_VFPv3 >>> >> > + cmp r1, #HSR_EC_CP_0_13 >>> >> > + beq switch_to_guest_vfp >>> >> > +#endif >>> >> >>> >> if we don't configure the host for VFPv3 and still have the guest use >>> >> VFPv3 (regardless of whether it exists in hardware or not), does the >>> >> flow happen as expected with an undefined exception into the guest >>> >> through handle_exit? If this is as it should be, perhaps a comment is >>> >> warranted? >>> > >>> > >>> > Yes; when VFPv3 is off, then we don't compile any of the new code, >>> > however >>> > we still trap CP10/11 so it will be handled by handled_exit, which leads >>> > to >>> > an Undef exception (or we could emulate it by software in the future, >>> > etc). >>> > I'll add a comment to clear this up. >>> > >>> >>> what would you emulate in software, the VFP operations? Let's not do that. >>> >> >> I was not actually planning to do that, just saying that if VFPv3 is off, >> then the cp10/c11 exits will be handled in the usual way, allowing you to do >> even something like that. But I agree, this is not something we should >> actually do. >> >>> >>> >> >>> >> >>> >> I am afraid, however, that this might no go very well, since the guest >>> >> may examine the faulting instruction and decide that it was an >>> >> undefined exception on some VFP operation and assume it can emulate >>> >> that and that may go into some weird loop? >>> > >>> > >>> > The Media and VFP Feature registers that tell to the guest whether VFP >>> > or >>> > NEON is implemented, are also part of the cp10 coprocessor registers, >>> > which >>> > we trap. So a proper guest should conclude that there is no hardware >>> > floating point support, and should expect an Undef for all VFP >>> > instructions >>> > and work around that. In any case that is how the real hardware would >>> > behave >>> > without floating point support. >>> > >>> >>> what happens if the guest tries to set the FPEXC.EN bit without VFP3 >>> support? Does it get an undefined exception from us and is that >>> equivalent to what would happen on real hardware? >>> >> >> Yep, it will cause an undefined exception. FPEXC is a coprocessor 10 >> register, so this is what we would expect if floating support is missing. >> >>> >>> > Though I don't know if the Linux kernel can fallback to *completely* >>> > emulating the full VFP or NEON instruction set. I think guests that >>> > don't >>> > outright assume hardware floating point are not very common these days. >>> > >>> >>> in any case, not our concern. >>> >> >> Agreed. >> >>> >>> >> >>> >> >>> >> > cmp r1, #HSR_EC_HVC >>> >> > bne guest_trap @ Not HVC instr. >>> >> > >>> >> > @@ -661,6 +732,23 @@ guest_trap: >>> >> > 1: mov r0, #ARM_EXCEPTION_HVC >>> >> > b __kvm_vcpu_return >>> >> > >>> >> > +#ifdef CONFIG_VFPv3 >>> >> > +switch_to_guest_vfp: >>> >> > + ldr r0, [sp, #12] @ Load VCPU pointer >>> >> > + push {r3-r7} >>> >> > + >>> >> > + @ NEON/VFP used. Turn on VFP access. >>> >> > + set_hcptr 0, (HCPTR_TCP(10) | HCPTR_TCP(11)) >>> >> > + >>> >> > + @ Switch VFP/NEON hardware state to the guest's >>> >> > + store_vfp_state #VCPU_VFP_HOST, r0 >>> >> > + restore_vfp_state #VCPU_VFP_GUEST, r0 >>> >> > + >>> >> > + pop {r3-r7} >>> >> > + pop {r0-r2} >>> >> > + eret >>> >> > +#endif >>> >> > + >>> >> > .align >>> >> > hyp_irq: >>> >> > push {r0} >>> >> >>> >> When compiling the v10-stage kernel in ARM mode with this patch >>> >> applied, I >>> >> get: >>> >> >>> >> arch/arm/kvm/interrupts.S: Assembler messages: >>> >> arch/arm/kvm/interrupts.S:691: Error: invalid constant (438) after >>> >> fixup >>> >> arch/arm/kvm/interrupts.S:697: Error: invalid constant (424) after >>> >> fixup >>> >> make[1]: *** [arch/arm/kvm/interrupts.o] Error 1 >>> > >>> > >>> > Ok, I will try to reproduce this on v10-stage. Thanks. >>> > >>> >> >>> >> >>> >> The change is pretty significant to the size and locations of the >>> >> kvm_vcpu_arch struct, so you may want to test for both ARM/Thumb2 >>> >> kernel. >>> >> >>> > >>> > I will also test a Thumb2 kernel then to make sure, thanks for the tip. >>> > >>> >> >>> >> Otherwise, besides my small comments, looks pretty good to me. >>> >> >> >> Applied (with small commenting changes) to v10-stage. Thanks for the great work! _______________________________________________ kvmarm mailing list kvmarm@xxxxxxxxxxxxxxxxxxxxx https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm