Re: [PATCH] KVM: nVMX: restore host state in nested_vmx_vmexit for VMFail

Sean Christopherson <sean.j.christopherson@xxxxxxxxx> · Fri, 20 Jul 2018 08:09:45 -0700

On Thu, 2018-07-19 at 15:13 -0700, Jim Mattson wrote:
> On Thu, Jul 19, 2018 at 10:56 AM, Sean Christopherson
> <sean.j.christopherson@xxxxxxxxx> wrote:
> > 
> > A VMEnter that VMFails (as opposed to VMExits) does not touch host
> > state beyond registers that are explicitly noted in the VMFail path,
> > e.g. EFLAGS.  Host state does not need to be loaded because VMFail
> > is only signaled for consistency checks that occur before the CPU
> > starts to load guest state, i.e. there is no need to restore any
> > state as nothing has been modified.  But in the case where a VMFail
> > is detected by hardware and not by KVM (due to deferring consistency
> > checks to hardware), KVM has already loaded some amount of guest
> > state.  Luckily, "loaded" only means loaded to KVM's software model,
> > i.e. vmcs01 has not been modified.  So, unwind our software model to
> > the pre-VMEntry host state.
> Thank you, thank you, thank you. I've been hoping someone would take this on.
> 
> > 
> > Not restoring host state in this VMFail path leads to a variety of
> > failures because we end up with stale data in vcpu->arch, e.g. CR0,
> > CR4, EFER, etc... will all be out of sync relative to vmcs01.  Any
> > significant delta in the stale data is all but guaranteed to crash
> > L1, e.g. emulation of SMEP, SMAP, UMIP, WP, etc... will be wrong.
> > 
> > An alternative to this "soft" reload would be to load host state from
> > vmcs12 as if we triggered a VMExit (as opposed to VMFail), but that is
> > wildly inconsistent with respect to the VMX architecture, e.g. an L1
> > VMM with separate VMExit and VMFail paths would explode.
> > 
> > Note that this approach does not mean KVM is 100% accurate with
> > respect to VMX hardware behavior, even at an architectural level
> > (the exact order of consistency checks is microarchitecture specific)
> > But 100% emulation accuracy isn't our goal, rather our goal is to be
> > consistent in the information delivered to L1, e.g. a VMExit should
> > not fall-through VMENTER, and a VMFail should not jump to HOST_RIP.
> > Achieving perfect emulation for consistency checks is a fool's errand,
> > e.g. KVM would need to manually detect *every* VMFail before checking
> > and loading any guest state (a consistency check VMExit never takes
> > priority over a consistency check VMFail).  Even if KVM managed to
> > achieve perfection (and didn't crater performance in the process),
> > the honeymoon would only last until KVM encountered new hardware with
> > new consistency checks.
> It should be possible to check for every VMFail in software, at least
> for the hardware that's emulated. When new hardware introduces new
> consistency checks, presumably the new checks involve new VMCS fields
> or new VM-execution controls that kvm doesn't yet emulate. Part of
> emulating the new hardware should include emulating the new checks.

Somehow I glossed over the fact that KVM would fail the entry for any
unknown controls.  This blurb can be shortened to:

  Note that this approach does not mean KVM is 100% accurate with
  respect to VMX hardware behavior, even at an architectural level
  (the exact order of consistency checks is microarchitecture specific).
  But 100% emulation accuracy isn't the goal (with this patch), rather
  the goal is to be consistent in the information delivered to L1, e.g.
  a VMExit should not fall-through VMENTER, and a VMFail should not jump
  to HOST_RIP.

> Moreover, kvm has to be able to check for every VMfail in software or
> risk priority inversion. For example, when processing the vmcs12
> VM-entry MSR-load list, kvm should not emulate a VM-exit for "VM-entry
> failure due to MSR loading" until it has verified that none of the
> control-field checks that were deferred to hardware would have
> resulted in a VMFail.
> 
> Of course, these concerns in no way detract from this welcome change.
> 
> > 
> > This technically reverts commit "5af4157388ad (KVM: nVMX: Fix mmu
> > context after VMLAUNCH/VMRESUME failure)", but retains the core
> > aspects of that patch, just in an open coded form due to the need to
> > pull state from vmcs01 instead of vmcs12.  Restoring host state
> > resolves a variety of issues introduced by commit "4f350c6dbcb9
> > (kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure properly)",
> > which remedied the incorrect behavior of treating VMFail like VMExit
> > but in doing so neglected to restore arch state that had been modified
> > prior to attempting nested VMEnter.
> > 
> > A sample failure that occurs due to stale vcpu.arch state is a fault
> > of some form while emulating an LGDT (due to emulated UMIP) from L1
> > L1 after a failed VMEntry to L3, in this case when running the KVM
> > unit test test_tpr_threshold_values in L1.  L0 also hits a WARN in
> > this case due to a stale arch.cr4.UMIP.
> > 
> > L1:
> >   BUG: unable to handle kernel paging request at ffffc90000663b9e
> >   PGD 276512067 P4D 276512067 PUD 276513067 PMD 274efa067 PTE 8000000271de2163
> >   Oops: 0009 [#1] SMP
> >   CPU: 5 PID: 12495 Comm: qemu-system-x86 Tainted: G        W         4.18.0-rc2+ #2
> >   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> >   RIP: 0010:native_load_gdt+0x0/0x10
> > 
> >   ...
> > 
> >   Call Trace:
> >    load_fixmap_gdt+0x22/0x30
> >    __vmx_load_host_state+0x10e/0x1c0 [kvm_intel]
> >    vmx_switch_vmcs+0x2d/0x50 [kvm_intel]
> >    nested_vmx_vmexit+0x222/0x9c0 [kvm_intel]
> >    vmx_handle_exit+0x246/0x15a0 [kvm_intel]
> >    kvm_arch_vcpu_ioctl_run+0x850/0x1830 [kvm]
> >    kvm_vcpu_ioctl+0x3a1/0x5c0 [kvm]
> >    do_vfs_ioctl+0x9f/0x600
> >    ksys_ioctl+0x66/0x70
> >    __x64_sys_ioctl+0x16/0x20
> >    do_syscall_64+0x4f/0x100
> >    entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > 
> > L0:
> >   WARNING: CPU: 2 PID: 3529 at arch/x86/kvm/vmx.c:6618 handle_desc+0x28/0x30 [kvm_intel]
> >   ...
> >   CPU: 2 PID: 3529 Comm: qemu-system-x86 Not tainted 4.17.2-coffee+ #76
> >   Hardware name: Intel Corporation Kabylake Client platform/KBL S
> >   RIP: 0010:handle_desc+0x28/0x30 [kvm_intel]
> > 
> >   ...
> > 
> >   Call Trace:
> >    kvm_arch_vcpu_ioctl_run+0x863/0x1840 [kvm]
> >    kvm_vcpu_ioctl+0x3a1/0x5c0 [kvm]
> >    do_vfs_ioctl+0x9f/0x5e0
> >    ksys_ioctl+0x66/0x70
> >    __x64_sys_ioctl+0x16/0x20
> >    do_syscall_64+0x49/0xf0
> >    entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > 
> > Fixes: 5af4157388ad (KVM: nVMX: Fix mmu context after VMLAUNCH/VMRESUME failure)
> > Fixes: 4f350c6dbcb9 (kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure properly)
> > Cc: Jim Mattson <jmattson@xxxxxxxxxx>
> > Cc: Krish Sadhukhan <krish.sadhukhan@xxxxxxxxxx>
> > Cc: Paolo Bonzini <pbonzini@xxxxxxxxxx>
> > Cc: Radim Krčmář <rkrcmar@xxxxxxxxxx>
> > Cc: Wanpeng Li <wanpeng.li@xxxxxxxxxxx>
> > Signed-off-by: Sean Christopherson <sean.j.christopherson@xxxxxxxxx>
> > ---
> >  arch/x86/kvm/vmx.c | 173 +++++++++++++++++++++++++++++++++++++++------
> >  1 file changed, 153 insertions(+), 20 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index 1689f433f3a0..eb5f49d2c46d 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -12199,24 +12199,6 @@ static void prepare_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
> >         kvm_clear_interrupt_queue(vcpu);
> >  }
> > 
> > -static void load_vmcs12_mmu_host_state(struct kvm_vcpu *vcpu,
> > -                       struct vmcs12 *vmcs12)
> > -{
> > -       u32 entry_failure_code;
> > -
> > -       nested_ept_uninit_mmu_context(vcpu);
> > -
> > -       /*
> > -        * Only PDPTE load can fail as the value of cr3 was checked on entry and
> > -        * couldn't have changed.
> > -        */
> > -       if (nested_vmx_load_cr3(vcpu, vmcs12->host_cr3, false, &entry_failure_code))
> > -               nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL);
> > -
> > -       if (!enable_ept)
> > -               vcpu->arch.walk_mmu->inject_page_fault = kvm_inject_page_fault;
> > -}
> > -
> >  /*
> >   * A part of what we need to when the nested L2 guest exits and we want to
> >   * run its L1 parent, is to reset L1's guest state to the host state specified
> > @@ -12230,6 +12212,7 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
> >                                    struct vmcs12 *vmcs12)
> >  {
> >         struct kvm_segment seg;
> > +       u32 entry_failure_code;
> > 
> >         if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_EFER)
> >                 vcpu->arch.efer = vmcs12->host_ia32_efer;
> > @@ -12256,7 +12239,17 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
> >         vcpu->arch.cr4_guest_owned_bits = ~vmcs_readl(CR4_GUEST_HOST_MASK);
> >         vmx_set_cr4(vcpu, vmcs12->host_cr4);
> > 
> > -       load_vmcs12_mmu_host_state(vcpu, vmcs12);
> > +       nested_ept_uninit_mmu_context(vcpu);
> > +
> > +       /*
> > +        * Only PDPTE load can fail as the value of cr3 was checked on entry and
> > +        * couldn't have changed.
> > +        */
> > +       if (nested_vmx_load_cr3(vcpu, vmcs12->host_cr3, false, &entry_failure_code))
> > +               nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL);
> > +
> > +       if (!enable_ept)
> > +               vcpu->arch.walk_mmu->inject_page_fault = kvm_inject_page_fault;
> > 
> >         /*
> >          * If vmcs01 don't use VPID, CPU flushes TLB on every
> > @@ -12352,6 +12345,140 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
> >                 nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_MSR_FAIL);
> >  }
> > 
> > +static inline u64 nested_vmx_get_vmcs01_guest_efer(struct vcpu_vmx *vmx)
> > +{
> > +       struct shared_msr_entry *efer_msr;
> > +       unsigned int i;
> > +
> > +       if (vm_entry_controls_get(vmx) & VM_ENTRY_LOAD_IA32_EFER)
> > +               return vmcs_read64(GUEST_IA32_EFER);
> > +
> > +       if (cpu_has_load_ia32_efer)
> > +               return host_efer;
> > +
> > +       for (i = 0; i < vmx->msr_autoload.nr; ++i) {
> > +               if (vmx->msr_autoload.guest[i].index == MSR_EFER)
> > +                       return vmx->msr_autoload.guest[i].value;
> > +       }
> > +
> > +       efer_msr = find_msr_entry(vmx, MSR_EFER);
> > +       if (efer_msr)
> > +               return efer_msr->data;
> > +
> > +       return host_efer;
> > +}
> > +
> > +static void nested_vmx_restore_host_state(struct kvm_vcpu *vcpu)
> > +{
> > +       struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> > +       struct vcpu_vmx *vmx = to_vmx(vcpu);
> > +       struct vmx_msr_entry g, h;
> > +       struct msr_data msr;
> > +       gpa_t gpa;
> > +       u32 i, j;
> > +
> > +       vcpu->arch.pat = vmcs_read64(GUEST_IA32_PAT);
> > +
> > +       if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_DEBUG_CONTROLS) {
> > +               /*
> > +                * L1's host DR7 is lost if KVM_GUESTDBG_USE_HW_BP is set
> > +                * as vmcs01.GUEST_DR7 contains a userspace defined value
> > +                * and vcpu->arch.dr7 is not squirreled away before the
> > +                * nested VMENTER (not worth adding a variable in nested_vmx).
> > +                */
> > +               if (vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP)
> > +                       kvm_set_dr(vcpu, 7, DR7_FIXED_1);
> > +               else
> > +                       WARN_ON(kvm_set_dr(vcpu, 7, vmcs_readl(GUEST_DR7)));
> > +       }
> > +
> > +       /*
> > +        * Note that calling vmx_set_{efer,cr0,cr4} is important as they
> > +        * handle a variety of side effects to KVM's software model.
> > +        */
> > +       vmx_set_efer(vcpu, nested_vmx_get_vmcs01_guest_efer(vmx));
> > +
> > +       vcpu->arch.cr0_guest_owned_bits = X86_CR0_TS;
> > +       vmx_set_cr0(vcpu, vmcs_readl(CR0_READ_SHADOW));
> > +
> > +       vcpu->arch.cr4_guest_owned_bits = ~vmcs_readl(CR4_GUEST_HOST_MASK);
> > +       vmx_set_cr4(vcpu, vmcs_readl(CR4_READ_SHADOW));
> > +
> > +       nested_ept_uninit_mmu_context(vcpu);
> > +       vcpu->arch.cr3 = vmcs_readl(GUEST_CR3);
> > +       __set_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail);
> > +
> > +       /*
> > +        * Use ept_save_pdptrs(vcpu) to load the MMU's cached PDPTRs
> > +        * from vmcs01 (if necessary).  The PDPTRs are not loaded on
> > +        * VMFail, like everything else we just need to ensure our
> > +        * software model is up-to-date.
> > +        */
> > +       ept_save_pdptrs(vcpu);
> > +
> > +       kvm_mmu_reset_context(vcpu);
> > +
> > +       if (cpu_has_vmx_msr_bitmap())
> > +               vmx_update_msr_bitmap(vcpu);
> > +
> > +       /*
> > +        * This nasty bit of open coding is a compromise between blindly
> > +        * loading L1's MSRs using the exit load lists (incorrect emulation
> > +        * of VMFail), leaving the nested VM's MSRs in the software model
> > +        * (incorrect behavior) and snapshotting the modified MSRs (too
> > +        * expensive since the lists are unbound by hardware).  For each
> > +        * MSR that was (prematurely) loaded from the nested VMEntry load
> > +        * list, reload it from the exit load list if it exists and differs
> > +        * from the guest value.  The intent is to stuff host state as
> > +        * silently as possible, not to fully process the exit load list.
> > +        */
> Though the MSR lists are unbounded by hardware, the number of MSRs
> emulated by kvm is bounded and small. Snapshotting the modified MSRs
> shouldn't be that expensive, even if the same small set of MSRs occurs
> over and over again in vmcs12's VM-entry MSR-load list.
> 
> > 
> > +       msr.host_initiated = false;
> > +       for (i = 0; i < vmcs12->vm_entry_msr_load_count; i++) {
> > +               gpa = vmcs12->vm_entry_msr_load_addr + (i * sizeof(g));
> > +               if (kvm_vcpu_read_guest(vcpu, gpa, &g, sizeof(g))) {
> > +                       pr_debug_ratelimited(
> > +                               "%s read MSR index failed (%u, 0x%08llx)\n",
> > +                               __func__, i, gpa);
> > +                       goto vmabort;
> > +               }
> > +
> > +               for (j = 0; j < vmcs12->vm_exit_msr_load_count; j++) {
> > +                       gpa = vmcs12->vm_exit_msr_load_addr + (j * sizeof(h));
> > +                       if (kvm_vcpu_read_guest(vcpu, gpa, &h, sizeof(h))) {
> > +                               pr_debug_ratelimited(
> > +                                       "%s read MSR failed (%u, 0x%08llx)\n",
> > +                                       __func__, j, gpa);
> > +                               goto vmabort;
> > +                       }
> > +                       if (h.index != g.index)
> > +                               continue;
> > +                       if (h.value == g.value)
> > +                               break;
> > +
> > +                       if (nested_vmx_load_msr_check(vcpu, &h)) {
> > +                               pr_debug_ratelimited(
> > +                                       "%s check failed (%u, 0x%x, 0x%x)\n",
> > +                                       __func__, j, h.index, h.reserved);
> > +                               goto vmabort;
> > +                       }
> > +
> > +                       msr.index = h.index;
> > +                       msr.data = h.value;
> > +                       if (kvm_set_msr(vcpu, &msr)) {
> > +                               pr_debug_ratelimited(
> > +                                       "%s WRMSR failed (%u, 0x%x, 0x%llx)\n",
> > +                                       __func__, j, h.index, h.value);
> > +                               goto vmabort;
> > +                       }
> > +               }
> > +       }
> > +
> > +       return;
> > +
> > +vmabort:
> > +       nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_MSR_FAIL);
> > +}
> > +
> >  /*
> >   * Emulate an exit from nested guest (L2) to L1, i.e., prepare to run L1
> >   * and modify vmcs12 to make it see what it would expect to see there if
> > @@ -12490,7 +12617,13 @@ static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 exit_reason,
> >          */
> >         nested_vmx_failValid(vcpu, VMXERR_ENTRY_INVALID_CONTROL_FIELD);
> > 
> > -       load_vmcs12_mmu_host_state(vcpu, vmcs12);
> > +       /*
> > +        * Restore L1's host state to KVM's software model.  We're here
> > +        * because a consistency check was caught by hardware, which
> > +        * means some amount of guest state has been propagated to KVM's
> > +        * model and needs to be unwound to the host's state.
> > +        */
> > +       nested_vmx_restore_host_state(vcpu);
> > 
> >         /*
> >          * The emulated instruction was already skipped in
> > --
> > 2.18.0
> >