2017-11-14 13:12 GMT+08:00 Rik van Riel <riel@xxxxxxxxxx>: > Currently, every time a VCPU is scheduled out, the host kernel will > first save the guest FPU/xstate context, then load the qemu userspace > FPU context, only to then immediately save the qemu userspace FPU > context back to memory. When scheduling in a VCPU, the same extraneous > FPU loads and saves are done. > > This could be avoided by moving from a model where the guest FPU is > loaded and stored with preemption disabled, to a model where the > qemu userspace FPU is swapped out for the guest FPU context for > the duration of the KVM_RUN ioctl. > > This is done under the VCPU mutex, which is also taken when other > tasks inspect the VCPU FPU context, so the code should already be > safe for this change. That should come as no surprise, given that > s390 already has this optimization. > > No performance changes were detected in quick ping-pong tests on > my 4 socket system, which is expected since an FPU+xstate load is > on the order of 0.1us, while ping-ponging between CPUs is on the > order of 20us, and somewhat noisy. > > There may be other tests where performance changes are noticeable. The kvm/queue has the below splatting: [ 650.866212] Bad FPU state detected at kvm_put_guest_fpu+0x7d/0x210 [kvm], reinitializing FPU registers. [ 650.866232] ------------[ cut here ]------------ [ 650.866241] WARNING: CPU: 2 PID: 2583 at arch/x86/mm/extable.c:103 ex_handler_fprestore+0x5f/0x70 [ 650.866473] libahci wmi hid pinctrl_sunrisepoint video pinctrl_intel [ 650.866496] CPU: 2 PID: 2583 Comm: qemu-system-x86 Not tainted 4.14.0+ #7 [ 650.866500] Hardware name: Dell Inc. OptiPlex 7040/0JCTF8, BIOS 1.4.9 09/12/2016 [ 650.866503] task: ffff97a095a28000 task.stack: ffffa71c8585c000 [ 650.866509] RIP: 0010:ex_handler_fprestore+0x5f/0x70 [ 650.866512] RSP: 0018:ffffa71c8585fc28 EFLAGS: 00010282 [ 650.866519] RAX: 000000000000005b RBX: ffffa71c8585fc68 RCX: 0000000000000006 [ 650.866522] RDX: 0000000000000000 RSI: ffffffffb4d35333 RDI: 0000000000000282 [ 650.866526] RBP: 000000000000000d R08: 00000000fddae359 R09: 0000000000000000 [ 650.866529] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 650.866532] R13: 0000000000000000 R14: ffff97a095a30000 R15: 000055824b58e280 [ 650.866536] FS: 00007f6f8f22c700(0000) GS:ffff97a09ca00000(0000) knlGS:0000000000000000 [ 650.866540] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 650.866543] CR2: 00007f6f993f3000 CR3: 00000003d4aae005 CR4: 00000000003626e0 [ 650.866547] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 650.866550] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 650.866554] Call Trace: [ 650.866559] fixup_exception+0x32/0x40 [ 650.866567] do_general_protection+0xa0/0x1b0 [ 650.866574] general_protection+0x22/0x30 [ 650.866595] RIP: 0010:kvm_put_guest_fpu+0x7d/0x210 [kvm] [ 650.866599] RSP: 0018:ffffa71c8585fd18 EFLAGS: 00010246 [ 650.866605] RAX: 00000000ffffffff RBX: ffff97a095a30000 RCX: 0000000000000001 [ 650.866608] RDX: 00000000ffffffff RSI: 00000000f7d5d46a RDI: ffff97a095a30b80 [ 650.866611] RBP: 0000000000000000 R08: 00000000fddae359 R09: ffff97a095a28968 [ 650.866615] R10: 0000000000000000 R11: 00000000e8d39b88 R12: ffff97a095a31bc0 [ 650.866618] R13: 0000000000000000 R14: ffff97a095a30000 R15: 000055824b58e280 [ 650.866650] ? kvm_put_guest_fpu+0x27/0x210 [kvm] [ 650.866670] kvm_vcpu_reset+0x1be/0x250 [kvm] [ 650.866689] kvm_arch_vcpu_setup+0x2c/0x50 [kvm] [ 650.866707] kvm_vm_ioctl+0x31a/0x820 [kvm] [ 650.866712] ? __lock_acquire+0x809/0x1410 [ 650.866721] ? __lock_acquire+0x809/0x1410 [ 650.866734] do_vfs_ioctl+0x9f/0x6c0 [ 650.866743] ? __fget+0x108/0x1f0 [ 650.866752] SyS_ioctl+0x74/0x80 [ 650.866757] ? do_syscall_64+0xc4/0x3d0 [ 650.866764] do_syscall_64+0x8a/0x3d0 [ 650.866769] ? trace_hardirqs_on_thunk+0x1a/0x1c [ 650.866781] entry_SYSCALL64_slow_path+0x25/0x25 [ 650.866785] RIP: 0033:0x7f6f973a0f07 [ 650.866788] RSP: 002b:00007f6f8f22b968 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 650.866795] RAX: ffffffffffffffda RBX: 000000000000ae41 RCX: 00007f6f973a0f07 [ 650.866798] RDX: 0000000000000000 RSI: 000000000000ae41 RDI: 000000000000000b [ 650.866802] RBP: 0000000000000000 R08: 0000558248e26a40 R09: 000055824b58e280 [ 650.866805] R10: 0000558249593f40 R11: 0000000000000246 R12: 000055824b55ec90 [ 650.866808] R13: 00007ffd274d79ff R14: 00007f6f8f22c9c0 R15: 000055824b58e280 [ 650.867014] ---[ end trace 2c5d6cfaba0ee1b3 ]--- Regards, Wanpeng Li > > Signed-off-by: Rik van Riel <riel@xxxxxxxxxx> > Suggested-by: Christian Borntraeger <borntraeger@xxxxxxxxxx> > --- > arch/x86/include/asm/kvm_host.h | 13 +++++++++++++ > arch/x86/kvm/x86.c | 29 ++++++++++++----------------- > 2 files changed, 25 insertions(+), 17 deletions(-) > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index c73e493adf07..92e66685249e 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -536,7 +536,20 @@ struct kvm_vcpu_arch { > struct kvm_mmu_memory_cache mmu_page_cache; > struct kvm_mmu_memory_cache mmu_page_header_cache; > > + /* > + * QEMU userspace and the guest each have their own FPU state. > + * In vcpu_run, we switch between the user and guest FPU contexts. > + * While running a VCPU, the VCPU thread will have the guest FPU > + * context. > + * > + * Note that while the PKRU state lives inside the fpu registers, > + * it is switched out separately at VMENTER and VMEXIT time. The > + * "guest_fpu" state here contains the guest FPU context, with the > + * host PRKU bits. > + */ > + struct fpu user_fpu; > struct fpu guest_fpu; > + > u64 xcr0; > u64 guest_supported_xcr0; > u32 guest_xstate_size; > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 03869eb7fcd6..59912b20a830 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -2917,7 +2917,6 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) > srcu_read_unlock(&vcpu->kvm->srcu, idx); > pagefault_enable(); > kvm_x86_ops->vcpu_put(vcpu); > - kvm_put_guest_fpu(vcpu); > vcpu->arch.last_host_tsc = rdtsc(); > } > > @@ -6908,7 +6907,6 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) > preempt_disable(); > > kvm_x86_ops->prepare_guest_switch(vcpu); > - kvm_load_guest_fpu(vcpu); > > /* > * Disable IRQs before setting IN_GUEST_MODE. Posted interrupt > @@ -7095,6 +7093,8 @@ static int vcpu_run(struct kvm_vcpu *vcpu) > > vcpu->srcu_idx = srcu_read_lock(&kvm->srcu); > > + kvm_load_guest_fpu(vcpu); > + > for (;;) { > if (kvm_vcpu_running(vcpu)) { > r = vcpu_enter_guest(vcpu); > @@ -7132,6 +7132,8 @@ static int vcpu_run(struct kvm_vcpu *vcpu) > } > } > > + kvm_put_guest_fpu(vcpu); > + > srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx); > > return r; > @@ -7663,32 +7665,25 @@ static void fx_init(struct kvm_vcpu *vcpu) > vcpu->arch.cr0 |= X86_CR0_ET; > } > > +/* Swap (qemu) user FPU context for the guest FPU context. */ > void kvm_load_guest_fpu(struct kvm_vcpu *vcpu) > { > - if (vcpu->guest_fpu_loaded) > - return; > - > - /* > - * Restore all possible states in the guest, > - * and assume host would use all available bits. > - * Guest xcr0 would be loaded later. > - */ > - vcpu->guest_fpu_loaded = 1; > - __kernel_fpu_begin(); > + preempt_disable(); > + copy_fpregs_to_fpstate(&vcpu->arch.user_fpu); > /* PKRU is separately restored in kvm_x86_ops->run. */ > __copy_kernel_to_fpregs(&vcpu->arch.guest_fpu.state, > ~XFEATURE_MASK_PKRU); > + preempt_enable(); > trace_kvm_fpu(1); > } > > +/* When vcpu_run ends, restore user space FPU context. */ > void kvm_put_guest_fpu(struct kvm_vcpu *vcpu) > { > - if (!vcpu->guest_fpu_loaded) > - return; > - > - vcpu->guest_fpu_loaded = 0; > + preempt_disable(); > copy_fpregs_to_fpstate(&vcpu->arch.guest_fpu); > - __kernel_fpu_end(); > + copy_kernel_to_fpregs(&vcpu->arch.user_fpu.state); > + preempt_enable(); > ++vcpu->stat.fpu_reload; > trace_kvm_fpu(0); > } >