On Fri, 2018-09-14 at 19:29 +0200, Paolo Bonzini wrote: > On 22/08/2018 23:57, Sean Christopherson wrote: > > > > A VMEnter that VMFails (as opposed to VMExits) does not touch host > > state beyond registers that are explicitly noted in the VMFail path, > > e.g. EFLAGS. Host state does not need to be loaded because VMFail > > is only signaled for consistency checks that occur before the CPU > > starts to load guest state, i.e. there is no need to restore any > > state as nothing has been modified. But in the case where a VMFail > > is detected by hardware and not by KVM (due to deferring consistency > > checks to hardware), KVM has already loaded some amount of guest > > state. Luckily, "loaded" only means loaded to KVM's software model, > > i.e. vmcs01 has not been modified. So, unwind our software model to > > the pre-VMEntry host state. > > > > Not restoring host state in this VMFail path leads to a variety of > > failures because we end up with stale data in vcpu->arch, e.g. CR0, > > CR4, EFER, etc... will all be out of sync relative to vmcs01. Any > > significant delta in the stale data is all but guaranteed to crash > > L1, e.g. emulation of SMEP, SMAP, UMIP, WP, etc... will be wrong. > > > > An alternative to this "soft" reload would be to load host state from > > vmcs12 as if we triggered a VMExit (as opposed to VMFail), but that is > > wildly inconsistent with respect to the VMX architecture, e.g. an L1 > > VMM with separate VMExit and VMFail paths would explode. > > > > Note that this approach does not mean KVM is 100% accurate with > > respect to VMX hardware behavior, even at an architectural level > > (the exact order of consistency checks is microarchitecture specific). > > But 100% emulation accuracy isn't the goal (with this patch), rather > > the goal is to be consistent in the information delivered to L1, e.g. > > a VMExit should not fall-through VMENTER, and a VMFail should not jump > > to HOST_RIP. > > > > This technically reverts commit "5af4157388ad (KVM: nVMX: Fix mmu > > context after VMLAUNCH/VMRESUME failure)", but retains the core > > aspects of that patch, just in an open coded form due to the need to > > pull state from vmcs01 instead of vmcs12. Restoring host state > > resolves a variety of issues introduced by commit "4f350c6dbcb9 > > (kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure properly)", > > which remedied the incorrect behavior of treating VMFail like VMExit > > but in doing so neglected to restore arch state that had been modified > > prior to attempting nested VMEnter. > > > > A sample failure that occurs due to stale vcpu.arch state is a fault > > of some form while emulating an LGDT (due to emulated UMIP) from L1 > > after a failed VMEntry to L3, in this case when running the KVM unit > > test test_tpr_threshold_values in L1. L0 also hits a WARN in this > > case due to a stale arch.cr4.UMIP. > > > > L1: > > BUG: unable to handle kernel paging request at ffffc90000663b9e > > PGD 276512067 P4D 276512067 PUD 276513067 PMD 274efa067 PTE 8000000271de2163 > > Oops: 0009 [#1] SMP > > CPU: 5 PID: 12495 Comm: qemu-system-x86 Tainted: G W 4.18.0-rc2+ #2 > > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 > > RIP: 0010:native_load_gdt+0x0/0x10 > > > > ... > > > > Call Trace: > > load_fixmap_gdt+0x22/0x30 > > __vmx_load_host_state+0x10e/0x1c0 [kvm_intel] > > vmx_switch_vmcs+0x2d/0x50 [kvm_intel] > > nested_vmx_vmexit+0x222/0x9c0 [kvm_intel] > > vmx_handle_exit+0x246/0x15a0 [kvm_intel] > > kvm_arch_vcpu_ioctl_run+0x850/0x1830 [kvm] > > kvm_vcpu_ioctl+0x3a1/0x5c0 [kvm] > > do_vfs_ioctl+0x9f/0x600 > > ksys_ioctl+0x66/0x70 > > __x64_sys_ioctl+0x16/0x20 > > do_syscall_64+0x4f/0x100 > > entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > > > L0: > > WARNING: CPU: 2 PID: 3529 at arch/x86/kvm/vmx.c:6618 handle_desc+0x28/0x30 [kvm_intel] > > ... > > CPU: 2 PID: 3529 Comm: qemu-system-x86 Not tainted 4.17.2-coffee+ #76 > > Hardware name: Intel Corporation Kabylake Client platform/KBL S > > RIP: 0010:handle_desc+0x28/0x30 [kvm_intel] > > > > ... > > > > Call Trace: > > kvm_arch_vcpu_ioctl_run+0x863/0x1840 [kvm] > > kvm_vcpu_ioctl+0x3a1/0x5c0 [kvm] > > do_vfs_ioctl+0x9f/0x5e0 > > ksys_ioctl+0x66/0x70 > > __x64_sys_ioctl+0x16/0x20 > > do_syscall_64+0x49/0xf0 > > entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > > > Fixes: 5af4157388ad (KVM: nVMX: Fix mmu context after VMLAUNCH/VMRESUME failure) > > Fixes: 4f350c6dbcb9 (kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure properly) > > Cc: Jim Mattson <jmattson@xxxxxxxxxx> > > Cc: Krish Sadhukhan <krish.sadhukhan@xxxxxxxxxx> > > Cc: Paolo Bonzini <pbonzini@xxxxxxxxxx> > > Cc: Radim Krčmář <rkrcmar@xxxxxxxxxx> > > Cc: Wanpeng Li <wanpeng.li@xxxxxxxxxxx> > > Signed-off-by: Sean Christopherson <sean.j.christopherson@xxxxxxxxx> > Queued, thanks Sean. Since I'm drowning in the submissions, there was > no unit test for this patch, right? (hint, hint ;) No unit test, but it's easy to hit by running the existing unit tests in a VM, e.g. in L1 where L0 is buggy. I don't see this in kvm/queue, any chance it got missed? Or am I looking in the wrong place?