Re: [PATCH] KVM: nVMX: restore host state in nested_vmx_vmexit for VMFail

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, 2018-07-23 at 15:26 -0700, Krish Sadhukhan wrote:
> 
> On 07/19/2018 10:56 AM, Sean Christopherson wrote:
> > 
> > A VMEnter that VMFails (as opposed to VMExits) does not touch host
> > state beyond registers that are explicitly noted in the VMFail path,
> > e.g. EFLAGS.  Host state does not need to be loaded because VMFail
> > is only signaled for consistency checks that occur before the CPU
> > starts to load guest state, i.e. there is no need to restore any
> > state as nothing has been modified.  But in the case where a VMFail
> > is detected by hardware and not by KVM (due to deferring consistency
> > checks to hardware), KVM has already loaded some amount of guest
> > state.  Luckily, "loaded" only means loaded to KVM's software model,
> > i.e. vmcs01 has not been modified.  So, unwind our software model to
> > the pre-VMEntry host state.
> > 
> > Not restoring host state in this VMFail path leads to a variety of
> > failures because we end up with stale data in vcpu->arch, e.g. CR0,
> > CR4, EFER, etc... will all be out of sync relative to vmcs01.  Any
> > significant delta in the stale data is all but guaranteed to crash
> > L1, e.g. emulation of SMEP, SMAP, UMIP, WP, etc... will be wrong.
> > 
> > An alternative to this "soft" reload would be to load host state from
> > vmcs12 as if we triggered a VMExit (as opposed to VMFail), but that is
> > wildly inconsistent with respect to the VMX architecture, e.g. an L1
> > VMM with separate VMExit and VMFail paths would explode.
> > 
> > Note that this approach does not mean KVM is 100% accurate with
> > respect to VMX hardware behavior, even at an architectural level
> > (the exact order of consistency checks is microarchitecture specific)
> > But 100% emulation accuracy isn't our goal, rather our goal is to be
> > consistent in the information delivered to L1, e.g. a VMExit should
> > not fall-through VMENTER, and a VMFail should not jump to HOST_RIP.
> > Achieving perfect emulation for consistency checks is a fool's errand,
> > e.g. KVM would need to manually detect *every* VMFail before checking
> > and loading any guest state (a consistency check VMExit never takes
> > priority over a consistency check VMFail).  Even if KVM managed to
> > achieve perfection (and didn't crater performance in the process),
> > the honeymoon would only last until KVM encountered new hardware with
> > new consistency checks.
> > 
> > This technically reverts commit "5af4157388ad (KVM: nVMX: Fix mmu
> > context after VMLAUNCH/VMRESUME failure)", but retains the core
> > aspects of that patch, just in an open coded form due to the need to
> > pull state from vmcs01 instead of vmcs12.  Restoring host state
> > resolves a variety of issues introduced by commit "4f350c6dbcb9
> > (kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure properly)",
> > which remedied the incorrect behavior of treating VMFail like VMExit
> > but in doing so neglected to restore arch state that had been modified
> > prior to attempting nested VMEnter.
> > 
> > A sample failure that occurs due to stale vcpu.arch state is a fault
> > of some form while emulating an LGDT (due to emulated UMIP) from L1
> > L1 after a failed VMEntry to L3, in this case when running the KVM
> Nit:  "L1"  is repeated twice here.
> 
> > 
> > unit test test_tpr_threshold_values in L1.  L0 also hits a WARN in
> > this case due to a stale arch.cr4.UMIP.
> > 
> > L1:
> >    BUG: unable to handle kernel paging request at ffffc90000663b9e
> >    PGD 276512067 P4D 276512067 PUD 276513067 PMD 274efa067 PTE 8000000271de2163
> >    Oops: 0009 [#1] SMP
> >    CPU: 5 PID: 12495 Comm: qemu-system-x86 Tainted: G        W         4.18.0-rc2+ #2
> >    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> >    RIP: 0010:native_load_gdt+0x0/0x10
> > 
> >    ...
> > 
> >    Call Trace:
> >     load_fixmap_gdt+0x22/0x30
> >     __vmx_load_host_state+0x10e/0x1c0 [kvm_intel]
> >     vmx_switch_vmcs+0x2d/0x50 [kvm_intel]
> >     nested_vmx_vmexit+0x222/0x9c0 [kvm_intel]
> >     vmx_handle_exit+0x246/0x15a0 [kvm_intel]
> >     kvm_arch_vcpu_ioctl_run+0x850/0x1830 [kvm]
> >     kvm_vcpu_ioctl+0x3a1/0x5c0 [kvm]
> >     do_vfs_ioctl+0x9f/0x600
> >     ksys_ioctl+0x66/0x70
> >     __x64_sys_ioctl+0x16/0x20
> >     do_syscall_64+0x4f/0x100
> >     entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > 
> > L0:
> >    WARNING: CPU: 2 PID: 3529 at arch/x86/kvm/vmx.c:6618 handle_desc+0x28/0x30 [kvm_intel]
> >    ...
> >    CPU: 2 PID: 3529 Comm: qemu-system-x86 Not tainted 4.17.2-coffee+ #76
> >    Hardware name: Intel Corporation Kabylake Client platform/KBL S
> >    RIP: 0010:handle_desc+0x28/0x30 [kvm_intel]
> > 
> >    ...
> > 
> >    Call Trace:
> >     kvm_arch_vcpu_ioctl_run+0x863/0x1840 [kvm]
> >     kvm_vcpu_ioctl+0x3a1/0x5c0 [kvm]
> >     do_vfs_ioctl+0x9f/0x5e0
> >     ksys_ioctl+0x66/0x70
> >     __x64_sys_ioctl+0x16/0x20
> >     do_syscall_64+0x49/0xf0
> >     entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > 
> > Fixes: 5af4157388ad (KVM: nVMX: Fix mmu context after VMLAUNCH/VMRESUME failure)
> > Fixes: 4f350c6dbcb9 (kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure properly)
> > Cc: Jim Mattson <jmattson@xxxxxxxxxx>
> > Cc: Krish Sadhukhan <krish.sadhukhan@xxxxxxxxxx>
> > Cc: Paolo Bonzini <pbonzini@xxxxxxxxxx>
> > Cc: Radim Krčmář <rkrcmar@xxxxxxxxxx>
> > Cc: Wanpeng Li <wanpeng.li@xxxxxxxxxxx>
> > Signed-off-by: Sean Christopherson <sean.j.christopherson@xxxxxxxxx>
> > ---
> >   arch/x86/kvm/vmx.c | 173 +++++++++++++++++++++++++++++++++++++++------
> >   1 file changed, 153 insertions(+), 20 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index 1689f433f3a0..eb5f49d2c46d 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -12199,24 +12199,6 @@ static void prepare_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
> >   	kvm_clear_interrupt_queue(vcpu);
> >   }
> >   
> > -static void load_vmcs12_mmu_host_state(struct kvm_vcpu *vcpu,
> > -			struct vmcs12 *vmcs12)
> > -{
> > -	u32 entry_failure_code;
> > -
> > -	nested_ept_uninit_mmu_context(vcpu);
> > -
> > -	/*
> > -	 * Only PDPTE load can fail as the value of cr3 was checked on entry and
> > -	 * couldn't have changed.
> > -	 */
> > -	if (nested_vmx_load_cr3(vcpu, vmcs12->host_cr3, false, &entry_failure_code))
> > -		nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL);
> > -
> > -	if (!enable_ept)
> > -		vcpu->arch.walk_mmu->inject_page_fault = kvm_inject_page_fault;
> > -}
> > -
> >   /*
> >    * A part of what we need to when the nested L2 guest exits and we want to
> >    * run its L1 parent, is to reset L1's guest state to the host state specified
> > @@ -12230,6 +12212,7 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
> >   				   struct vmcs12 *vmcs12)
> >   {
> >   	struct kvm_segment seg;
> > +	u32 entry_failure_code;
> >   
> >   	if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_EFER)
> >   		vcpu->arch.efer = vmcs12->host_ia32_efer;
> > @@ -12256,7 +12239,17 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
> >   	vcpu->arch.cr4_guest_owned_bits = ~vmcs_readl(CR4_GUEST_HOST_MASK);
> >   	vmx_set_cr4(vcpu, vmcs12->host_cr4);
> >   
> > -	load_vmcs12_mmu_host_state(vcpu, vmcs12);
> > +	nested_ept_uninit_mmu_context(vcpu);
> > +
> > +	/*
> > +	 * Only PDPTE load can fail as the value of cr3 was checked on entry and
> > +	 * couldn't have changed.
> > +	 */
> > +	if (nested_vmx_load_cr3(vcpu, vmcs12->host_cr3, false, &entry_failure_code))
> > +		nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL);
> > +
> > +	if (!enable_ept)
> > +		vcpu->arch.walk_mmu->inject_page_fault = kvm_inject_page_fault;
> >   
> >   	/*
> >   	 * If vmcs01 don't use VPID, CPU flushes TLB on every
> > @@ -12352,6 +12345,140 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
> >   		nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_MSR_FAIL);
> >   }
> >   
> > +static inline u64 nested_vmx_get_vmcs01_guest_efer(struct vcpu_vmx *vmx)
> > +{
> > +	struct shared_msr_entry *efer_msr;
> > +	unsigned int i;
> > +
> > +	if (vm_entry_controls_get(vmx) & VM_ENTRY_LOAD_IA32_EFER)
> > +		return vmcs_read64(GUEST_IA32_EFER);
> > +
> > +	if (cpu_has_load_ia32_efer)
> > +		return host_efer;
> > +
> > +	for (i = 0; i < vmx->msr_autoload.nr; ++i) {
> > +		if (vmx->msr_autoload.guest[i].index == MSR_EFER)
> > +			return vmx->msr_autoload.guest[i].value;
> > +	}
> > +
> > +	efer_msr = find_msr_entry(vmx, MSR_EFER);
> > +	if (efer_msr)
> > +		return efer_msr->data;
> > +
> > +	return host_efer;
> > +}
> > +
> > +static void nested_vmx_restore_host_state(struct kvm_vcpu *vcpu)
> For readability purpose, should we add some comments outlining the 
> prominent elements that we are restoring in this function ?

Hmm what about a high level comment?  I'm hesistant to list things
explicitly as that tends to result in stale comments.  And I didn't
have a method for choosing what (not) to restore beyond ensuring
everything that might have been touched in enter_vmx_non_root_mode()
was restored.

Something like:

  Unwind KVM's software model, i.e. vcpu->arch, to L1's pre-VMEnter
  state, i.e. L1 host state, pulling data from vmcs01.  The rule of
  thumb is that anything touched by enter_vmx_non_root_mode() needs
  to be restored.

> > 
> > +{
> > +	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> > +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> > +	struct vmx_msr_entry g, h;
> > +	struct msr_data msr;
> > +	gpa_t gpa;
> > +	u32 i, j;
> > +
> > +	vcpu->arch.pat = vmcs_read64(GUEST_IA32_PAT);
> > +
> > +	if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_DEBUG_CONTROLS) {
> > +		/*
> > +		 * L1's host DR7 is lost if KVM_GUESTDBG_USE_HW_BP is set
> > +		 * as vmcs01.GUEST_DR7 contains a userspace defined value
> > +		 * and vcpu->arch.dr7 is not squirreled away before the
> > +		 * nested VMENTER (not worth adding a variable in nested_vmx).
> > +		 */
> > +		if (vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP)
> > +			kvm_set_dr(vcpu, 7, DR7_FIXED_1);
> > +		else
> > +			WARN_ON(kvm_set_dr(vcpu, 7, vmcs_readl(GUEST_DR7)));
> > +	}
> > +
> > +	/*
> > +	 * Note that calling vmx_set_{efer,cr0,cr4} is important as they
> > +	 * handle a variety of side effects to KVM's software model.
> > +	 */
> > +	vmx_set_efer(vcpu, nested_vmx_get_vmcs01_guest_efer(vmx));
> > +
> > +	vcpu->arch.cr0_guest_owned_bits = X86_CR0_TS;
> > +	vmx_set_cr0(vcpu, vmcs_readl(CR0_READ_SHADOW));
> > +
> > +	vcpu->arch.cr4_guest_owned_bits = ~vmcs_readl(CR4_GUEST_HOST_MASK);
> > +	vmx_set_cr4(vcpu, vmcs_readl(CR4_READ_SHADOW));
> > +
> > +	nested_ept_uninit_mmu_context(vcpu);
> > +	vcpu->arch.cr3 = vmcs_readl(GUEST_CR3);
> > +	__set_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail);
> > +
> > +	/*
> > +	 * Use ept_save_pdptrs(vcpu) to load the MMU's cached PDPTRs
> > +	 * from vmcs01 (if necessary).  The PDPTRs are not loaded on
> > +	 * VMFail, like everything else we just need to ensure our
> > +	 * software model is up-to-date.
> > +	 */
> > +	ept_save_pdptrs(vcpu);
> > +
> > +	kvm_mmu_reset_context(vcpu);
> > +
> > +	if (cpu_has_vmx_msr_bitmap())
> > +		vmx_update_msr_bitmap(vcpu);
> > +
> > +	/*
> > +	 * This nasty bit of open coding is a compromise between blindly
> > +	 * loading L1's MSRs using the exit load lists (incorrect emulation
> > +	 * of VMFail), leaving the nested VM's MSRs in the software model
> > +	 * (incorrect behavior) and snapshotting the modified MSRs (too
> > +	 * expensive since the lists are unbound by hardware).  For each
> > +	 * MSR that was (prematurely) loaded from the nested VMEntry load
> > +	 * list, reload it from the exit load list if it exists and differs
> > +	 * from the guest value.  The intent is to stuff host state as
> > +	 * silently as possible, not to fully process the exit load list.
> > +	 */
> > +	msr.host_initiated = false;
> > +	for (i = 0; i < vmcs12->vm_entry_msr_load_count; i++) {
> > +		gpa = vmcs12->vm_entry_msr_load_addr + (i * sizeof(g));
> > +		if (kvm_vcpu_read_guest(vcpu, gpa, &g, sizeof(g))) {
> > +			pr_debug_ratelimited(
> > +				"%s read MSR index failed (%u, 0x%08llx)\n",
> > +				__func__, i, gpa);
> > +			goto vmabort;
> > +		}
> > +
> > +		for (j = 0; j < vmcs12->vm_exit_msr_load_count; j++) {
> > +			gpa = vmcs12->vm_exit_msr_load_addr + (j * sizeof(h));
> > +			if (kvm_vcpu_read_guest(vcpu, gpa, &h, sizeof(h))) {
> > +				pr_debug_ratelimited(
> > +					"%s read MSR failed (%u, 0x%08llx)\n",
> > +					__func__, j, gpa);
> > +				goto vmabort;
> > +			}
> > +			if (h.index != g.index)
> > +				continue;
> > +			if (h.value == g.value)
> > +				break;
> > +
> > +			if (nested_vmx_load_msr_check(vcpu, &h)) {
> > +				pr_debug_ratelimited(
> > +					"%s check failed (%u, 0x%x, 0x%x)\n",
> > +					__func__, j, h.index, h.reserved);
> > +				goto vmabort;
> > +			}
> > +
> > +			msr.index = h.index;
> > +			msr.data = h.value;
> > +			if (kvm_set_msr(vcpu, &msr)) {
> > +				pr_debug_ratelimited(
> > +					"%s WRMSR failed (%u, 0x%x, 0x%llx)\n",
> > +					__func__, j, h.index, h.value);
> > +				goto vmabort;
> > +			}
> > +		}
> > +	}
> > +
> Is it possible to re-use nested_vmx_load_msr() instead of duplicating 
> the same loops ?

We could use nested_vmx_load_msr() as-is to directly load on-exit list,
but that has the downside of potentially overwriting MSRs that should
not have been touched, or triggering side effects that should not have
occurred.  I considered modifying nested_vmx_load_msr() to splice in
this optional behavior, but the resulting code would be hideous and I
didn't want to convolute an otherwise straightforward function for this
one-off case.

> > 
> > +	return;
> > +
> > +vmabort:
> > +	nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_MSR_FAIL);
> > +}
> > +
> >   /*
> >    * Emulate an exit from nested guest (L2) to L1, i.e., prepare to run L1
> >    * and modify vmcs12 to make it see what it would expect to see there if
> > @@ -12490,7 +12617,13 @@ static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 exit_reason,
> >   	 */
> >   	nested_vmx_failValid(vcpu, VMXERR_ENTRY_INVALID_CONTROL_FIELD);
> >   
> > -	load_vmcs12_mmu_host_state(vcpu, vmcs12);
> > +	/*
> > +	 * Restore L1's host state to KVM's software model.  We're here
> > +	 * because a consistency check was caught by hardware, which
> > +	 * means some amount of guest state has been propagated to KVM's
> > +	 * model and needs to be unwound to the host's state.
> > +	 */
> > +	nested_vmx_restore_host_state(vcpu);
> >   
> >   	/*
> >   	 * The emulated instruction was already skipped in



[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux