Re: [PATCH v2] KVM: nVMX: restore host state in nested_vmx_vmexit for VMFail

Sean Christopherson <sean.j.christopherson@xxxxxxxxx> · Tue, 25 Sep 2018 09:39:50 -0700

On Fri, 2018-09-14 at 19:29 +0200, Paolo Bonzini wrote:
> On 22/08/2018 23:57, Sean Christopherson wrote:
> > 
> > A VMEnter that VMFails (as opposed to VMExits) does not touch host
> > state beyond registers that are explicitly noted in the VMFail path,
> > e.g. EFLAGS.  Host state does not need to be loaded because VMFail
> > is only signaled for consistency checks that occur before the CPU
> > starts to load guest state, i.e. there is no need to restore any
> > state as nothing has been modified.  But in the case where a VMFail
> > is detected by hardware and not by KVM (due to deferring consistency
> > checks to hardware), KVM has already loaded some amount of guest
> > state.  Luckily, "loaded" only means loaded to KVM's software model,
> > i.e. vmcs01 has not been modified.  So, unwind our software model to
> > the pre-VMEntry host state.
> > 
> > Not restoring host state in this VMFail path leads to a variety of
> > failures because we end up with stale data in vcpu->arch, e.g. CR0,
> > CR4, EFER, etc... will all be out of sync relative to vmcs01.  Any
> > significant delta in the stale data is all but guaranteed to crash
> > L1, e.g. emulation of SMEP, SMAP, UMIP, WP, etc... will be wrong.
> > 
> > An alternative to this "soft" reload would be to load host state from
> > vmcs12 as if we triggered a VMExit (as opposed to VMFail), but that is
> > wildly inconsistent with respect to the VMX architecture, e.g. an L1
> > VMM with separate VMExit and VMFail paths would explode.
> > 
> > Note that this approach does not mean KVM is 100% accurate with
> > respect to VMX hardware behavior, even at an architectural level
> > (the exact order of consistency checks is microarchitecture specific).
> > But 100% emulation accuracy isn't the goal (with this patch), rather
> > the goal is to be consistent in the information delivered to L1, e.g.
> > a VMExit should not fall-through VMENTER, and a VMFail should not jump
> > to HOST_RIP.
> > 
> > This technically reverts commit "5af4157388ad (KVM: nVMX: Fix mmu
> > context after VMLAUNCH/VMRESUME failure)", but retains the core
> > aspects of that patch, just in an open coded form due to the need to
> > pull state from vmcs01 instead of vmcs12.  Restoring host state
> > resolves a variety of issues introduced by commit "4f350c6dbcb9
> > (kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure properly)",
> > which remedied the incorrect behavior of treating VMFail like VMExit
> > but in doing so neglected to restore arch state that had been modified
> > prior to attempting nested VMEnter.
> > 
> > A sample failure that occurs due to stale vcpu.arch state is a fault
> > of some form while emulating an LGDT (due to emulated UMIP) from L1
> > after a failed VMEntry to L3, in this case when running the KVM unit
> > test test_tpr_threshold_values in L1.  L0 also hits a WARN in this
> > case due to a stale arch.cr4.UMIP.
> > 
> > L1:
> >   BUG: unable to handle kernel paging request at ffffc90000663b9e
> >   PGD 276512067 P4D 276512067 PUD 276513067 PMD 274efa067 PTE 8000000271de2163
> >   Oops: 0009 [#1] SMP
> >   CPU: 5 PID: 12495 Comm: qemu-system-x86 Tainted: G        W         4.18.0-rc2+ #2
> >   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> >   RIP: 0010:native_load_gdt+0x0/0x10
> > 
> >   ...
> > 
> >   Call Trace:
> >    load_fixmap_gdt+0x22/0x30
> >    __vmx_load_host_state+0x10e/0x1c0 [kvm_intel]
> >    vmx_switch_vmcs+0x2d/0x50 [kvm_intel]
> >    nested_vmx_vmexit+0x222/0x9c0 [kvm_intel]
> >    vmx_handle_exit+0x246/0x15a0 [kvm_intel]
> >    kvm_arch_vcpu_ioctl_run+0x850/0x1830 [kvm]
> >    kvm_vcpu_ioctl+0x3a1/0x5c0 [kvm]
> >    do_vfs_ioctl+0x9f/0x600
> >    ksys_ioctl+0x66/0x70
> >    __x64_sys_ioctl+0x16/0x20
> >    do_syscall_64+0x4f/0x100
> >    entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > 
> > L0:
> >   WARNING: CPU: 2 PID: 3529 at arch/x86/kvm/vmx.c:6618 handle_desc+0x28/0x30 [kvm_intel]
> >   ...
> >   CPU: 2 PID: 3529 Comm: qemu-system-x86 Not tainted 4.17.2-coffee+ #76
> >   Hardware name: Intel Corporation Kabylake Client platform/KBL S
> >   RIP: 0010:handle_desc+0x28/0x30 [kvm_intel]
> > 
> >   ...
> > 
> >   Call Trace:
> >    kvm_arch_vcpu_ioctl_run+0x863/0x1840 [kvm]
> >    kvm_vcpu_ioctl+0x3a1/0x5c0 [kvm]
> >    do_vfs_ioctl+0x9f/0x5e0
> >    ksys_ioctl+0x66/0x70
> >    __x64_sys_ioctl+0x16/0x20
> >    do_syscall_64+0x49/0xf0
> >    entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > 
> > Fixes: 5af4157388ad (KVM: nVMX: Fix mmu context after VMLAUNCH/VMRESUME failure)
> > Fixes: 4f350c6dbcb9 (kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure properly)
> > Cc: Jim Mattson <jmattson@xxxxxxxxxx>
> > Cc: Krish Sadhukhan <krish.sadhukhan@xxxxxxxxxx>
> > Cc: Paolo Bonzini <pbonzini@xxxxxxxxxx>
> > Cc: Radim Krčmář <rkrcmar@xxxxxxxxxx>
> > Cc: Wanpeng Li <wanpeng.li@xxxxxxxxxxx>
> > Signed-off-by: Sean Christopherson <sean.j.christopherson@xxxxxxxxx>
> Queued, thanks Sean.  Since I'm drowning in the submissions, there was
> no unit test for this patch, right? (hint, hint ;)

No unit test, but it's easy to hit by running the existing
unit tests in a VM, e.g. in L1 where L0 is buggy.

I don't see this in kvm/queue, any chance it got missed?  Or am
I looking in the wrong place?