Re: [PATCH v2] KVM: nVMX: restore host state in nested_vmx_vmexit for VMFail

Paolo Bonzini <pbonzini@xxxxxxxxxx> · Fri, 14 Sep 2018 19:29:19 +0200

On 22/08/2018 23:57, Sean Christopherson wrote:
> A VMEnter that VMFails (as opposed to VMExits) does not touch host
> state beyond registers that are explicitly noted in the VMFail path,
> e.g. EFLAGS.  Host state does not need to be loaded because VMFail
> is only signaled for consistency checks that occur before the CPU
> starts to load guest state, i.e. there is no need to restore any
> state as nothing has been modified.  But in the case where a VMFail
> is detected by hardware and not by KVM (due to deferring consistency
> checks to hardware), KVM has already loaded some amount of guest
> state.  Luckily, "loaded" only means loaded to KVM's software model,
> i.e. vmcs01 has not been modified.  So, unwind our software model to
> the pre-VMEntry host state.
> 
> Not restoring host state in this VMFail path leads to a variety of
> failures because we end up with stale data in vcpu->arch, e.g. CR0,
> CR4, EFER, etc... will all be out of sync relative to vmcs01.  Any
> significant delta in the stale data is all but guaranteed to crash
> L1, e.g. emulation of SMEP, SMAP, UMIP, WP, etc... will be wrong.
> 
> An alternative to this "soft" reload would be to load host state from
> vmcs12 as if we triggered a VMExit (as opposed to VMFail), but that is
> wildly inconsistent with respect to the VMX architecture, e.g. an L1
> VMM with separate VMExit and VMFail paths would explode.
> 
> Note that this approach does not mean KVM is 100% accurate with
> respect to VMX hardware behavior, even at an architectural level
> (the exact order of consistency checks is microarchitecture specific).
> But 100% emulation accuracy isn't the goal (with this patch), rather
> the goal is to be consistent in the information delivered to L1, e.g.
> a VMExit should not fall-through VMENTER, and a VMFail should not jump
> to HOST_RIP.
> 
> This technically reverts commit "5af4157388ad (KVM: nVMX: Fix mmu
> context after VMLAUNCH/VMRESUME failure)", but retains the core
> aspects of that patch, just in an open coded form due to the need to
> pull state from vmcs01 instead of vmcs12.  Restoring host state
> resolves a variety of issues introduced by commit "4f350c6dbcb9
> (kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure properly)",
> which remedied the incorrect behavior of treating VMFail like VMExit
> but in doing so neglected to restore arch state that had been modified
> prior to attempting nested VMEnter.
> 
> A sample failure that occurs due to stale vcpu.arch state is a fault
> of some form while emulating an LGDT (due to emulated UMIP) from L1
> after a failed VMEntry to L3, in this case when running the KVM unit
> test test_tpr_threshold_values in L1.  L0 also hits a WARN in this
> case due to a stale arch.cr4.UMIP.
> 
> L1:
>   BUG: unable to handle kernel paging request at ffffc90000663b9e
>   PGD 276512067 P4D 276512067 PUD 276513067 PMD 274efa067 PTE 8000000271de2163
>   Oops: 0009 [#1] SMP
>   CPU: 5 PID: 12495 Comm: qemu-system-x86 Tainted: G        W         4.18.0-rc2+ #2
>   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>   RIP: 0010:native_load_gdt+0x0/0x10
> 
>   ...
> 
>   Call Trace:
>    load_fixmap_gdt+0x22/0x30
>    __vmx_load_host_state+0x10e/0x1c0 [kvm_intel]
>    vmx_switch_vmcs+0x2d/0x50 [kvm_intel]
>    nested_vmx_vmexit+0x222/0x9c0 [kvm_intel]
>    vmx_handle_exit+0x246/0x15a0 [kvm_intel]
>    kvm_arch_vcpu_ioctl_run+0x850/0x1830 [kvm]
>    kvm_vcpu_ioctl+0x3a1/0x5c0 [kvm]
>    do_vfs_ioctl+0x9f/0x600
>    ksys_ioctl+0x66/0x70
>    __x64_sys_ioctl+0x16/0x20
>    do_syscall_64+0x4f/0x100
>    entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> L0:
>   WARNING: CPU: 2 PID: 3529 at arch/x86/kvm/vmx.c:6618 handle_desc+0x28/0x30 [kvm_intel]
>   ...
>   CPU: 2 PID: 3529 Comm: qemu-system-x86 Not tainted 4.17.2-coffee+ #76
>   Hardware name: Intel Corporation Kabylake Client platform/KBL S
>   RIP: 0010:handle_desc+0x28/0x30 [kvm_intel]
> 
>   ...
> 
>   Call Trace:
>    kvm_arch_vcpu_ioctl_run+0x863/0x1840 [kvm]
>    kvm_vcpu_ioctl+0x3a1/0x5c0 [kvm]
>    do_vfs_ioctl+0x9f/0x5e0
>    ksys_ioctl+0x66/0x70
>    __x64_sys_ioctl+0x16/0x20
>    do_syscall_64+0x49/0xf0
>    entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> Fixes: 5af4157388ad (KVM: nVMX: Fix mmu context after VMLAUNCH/VMRESUME failure)
> Fixes: 4f350c6dbcb9 (kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure properly)
> Cc: Jim Mattson <jmattson@xxxxxxxxxx>
> Cc: Krish Sadhukhan <krish.sadhukhan@xxxxxxxxxx>
> Cc: Paolo Bonzini <pbonzini@xxxxxxxxxx>
> Cc: Radim Krčmář <rkrcmar@xxxxxxxxxx>
> Cc: Wanpeng Li <wanpeng.li@xxxxxxxxxxx>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@xxxxxxxxx>

Queued, thanks Sean.  Since I'm drowning in the submissions, there was
no unit test for this patch, right? (hint, hint ;)

Paolo

> ---
> 
> v2: Rebased to latest kvm/nest, addressed a few issues in the commit
>     message.
> 
> 
>  arch/x86/kvm/vmx.c | 173 +++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 153 insertions(+), 20 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 8dae47e7267a..b6314eb2c8b8 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -13021,24 +13021,6 @@ static void prepare_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
>  	kvm_clear_interrupt_queue(vcpu);
>  }
>  
> -static void load_vmcs12_mmu_host_state(struct kvm_vcpu *vcpu,
> -			struct vmcs12 *vmcs12)
> -{
> -	u32 entry_failure_code;
> -
> -	nested_ept_uninit_mmu_context(vcpu);
> -
> -	/*
> -	 * Only PDPTE load can fail as the value of cr3 was checked on entry and
> -	 * couldn't have changed.
> -	 */
> -	if (nested_vmx_load_cr3(vcpu, vmcs12->host_cr3, false, &entry_failure_code))
> -		nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL);
> -
> -	if (!enable_ept)
> -		vcpu->arch.walk_mmu->inject_page_fault = kvm_inject_page_fault;
> -}
> -
>  /*
>   * A part of what we need to when the nested L2 guest exits and we want to
>   * run its L1 parent, is to reset L1's guest state to the host state specified
> @@ -13052,6 +13034,7 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
>  				   struct vmcs12 *vmcs12)
>  {
>  	struct kvm_segment seg;
> +	u32 entry_failure_code;
>  
>  	if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_EFER)
>  		vcpu->arch.efer = vmcs12->host_ia32_efer;
> @@ -13078,7 +13061,17 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
>  	vcpu->arch.cr4_guest_owned_bits = ~vmcs_readl(CR4_GUEST_HOST_MASK);
>  	vmx_set_cr4(vcpu, vmcs12->host_cr4);
>  
> -	load_vmcs12_mmu_host_state(vcpu, vmcs12);
> +	nested_ept_uninit_mmu_context(vcpu);
> +
> +	/*
> +	 * Only PDPTE load can fail as the value of cr3 was checked on entry and
> +	 * couldn't have changed.
> +	 */
> +	if (nested_vmx_load_cr3(vcpu, vmcs12->host_cr3, false, &entry_failure_code))
> +		nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL);
> +
> +	if (!enable_ept)
> +		vcpu->arch.walk_mmu->inject_page_fault = kvm_inject_page_fault;
>  
>  	/*
>  	 * If vmcs01 don't use VPID, CPU flushes TLB on every
> @@ -13174,6 +13167,140 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
>  		nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_MSR_FAIL);
>  }
>  
> +static inline u64 nested_vmx_get_vmcs01_guest_efer(struct vcpu_vmx *vmx)
> +{
> +	struct shared_msr_entry *efer_msr;
> +	unsigned int i;
> +
> +	if (vm_entry_controls_get(vmx) & VM_ENTRY_LOAD_IA32_EFER)
> +		return vmcs_read64(GUEST_IA32_EFER);
> +
> +	if (cpu_has_load_ia32_efer)
> +		return host_efer;
> +
> +	for (i = 0; i < vmx->msr_autoload.guest.nr; ++i) {
> +		if (vmx->msr_autoload.guest.val[i].index == MSR_EFER)
> +			return vmx->msr_autoload.guest.val[i].value;
> +	}
> +
> +	efer_msr = find_msr_entry(vmx, MSR_EFER);
> +	if (efer_msr)
> +		return efer_msr->data;
> +
> +	return host_efer;
> +}
> +
> +static void nested_vmx_restore_host_state(struct kvm_vcpu *vcpu)
> +{
> +	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +	struct vmx_msr_entry g, h;
> +	struct msr_data msr;
> +	gpa_t gpa;
> +	u32 i, j;
> +
> +	vcpu->arch.pat = vmcs_read64(GUEST_IA32_PAT);
> +
> +	if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_DEBUG_CONTROLS) {
> +		/*
> +		 * L1's host DR7 is lost if KVM_GUESTDBG_USE_HW_BP is set
> +		 * as vmcs01.GUEST_DR7 contains a userspace defined value
> +		 * and vcpu->arch.dr7 is not squirreled away before the
> +		 * nested VMENTER (not worth adding a variable in nested_vmx).
> +		 */
> +		if (vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP)
> +			kvm_set_dr(vcpu, 7, DR7_FIXED_1);
> +		else
> +			WARN_ON(kvm_set_dr(vcpu, 7, vmcs_readl(GUEST_DR7)));
> +	}
> +
> +	/*
> +	 * Note that calling vmx_set_{efer,cr0,cr4} is important as they
> +	 * handle a variety of side effects to KVM's software model.
> +	 */
> +	vmx_set_efer(vcpu, nested_vmx_get_vmcs01_guest_efer(vmx));
> +
> +	vcpu->arch.cr0_guest_owned_bits = X86_CR0_TS;
> +	vmx_set_cr0(vcpu, vmcs_readl(CR0_READ_SHADOW));
> +
> +	vcpu->arch.cr4_guest_owned_bits = ~vmcs_readl(CR4_GUEST_HOST_MASK);
> +	vmx_set_cr4(vcpu, vmcs_readl(CR4_READ_SHADOW));
> +
> +	nested_ept_uninit_mmu_context(vcpu);
> +	vcpu->arch.cr3 = vmcs_readl(GUEST_CR3);
> +	__set_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail);
> +
> +	/*
> +	 * Use ept_save_pdptrs(vcpu) to load the MMU's cached PDPTRs
> +	 * from vmcs01 (if necessary).  The PDPTRs are not loaded on
> +	 * VMFail, like everything else we just need to ensure our
> +	 * software model is up-to-date.
> +	 */
> +	ept_save_pdptrs(vcpu);
> +
> +	kvm_mmu_reset_context(vcpu);
> +
> +	if (cpu_has_vmx_msr_bitmap())
> +		vmx_update_msr_bitmap(vcpu);
> +
> +	/*
> +	 * This nasty bit of open coding is a compromise between blindly
> +	 * loading L1's MSRs using the exit load lists (incorrect emulation
> +	 * of VMFail), leaving the nested VM's MSRs in the software model
> +	 * (incorrect behavior) and snapshotting the modified MSRs (too
> +	 * expensive since the lists are unbound by hardware).  For each
> +	 * MSR that was (prematurely) loaded from the nested VMEntry load
> +	 * list, reload it from the exit load list if it exists and differs
> +	 * from the guest value.  The intent is to stuff host state as
> +	 * silently as possible, not to fully process the exit load list.
> +	 */
> +	msr.host_initiated = false;
> +	for (i = 0; i < vmcs12->vm_entry_msr_load_count; i++) {
> +		gpa = vmcs12->vm_entry_msr_load_addr + (i * sizeof(g));
> +		if (kvm_vcpu_read_guest(vcpu, gpa, &g, sizeof(g))) {
> +			pr_debug_ratelimited(
> +				"%s read MSR index failed (%u, 0x%08llx)\n",
> +				__func__, i, gpa);
> +			goto vmabort;
> +		}
> +
> +		for (j = 0; j < vmcs12->vm_exit_msr_load_count; j++) {
> +			gpa = vmcs12->vm_exit_msr_load_addr + (j * sizeof(h));
> +			if (kvm_vcpu_read_guest(vcpu, gpa, &h, sizeof(h))) {
> +				pr_debug_ratelimited(
> +					"%s read MSR failed (%u, 0x%08llx)\n",
> +					__func__, j, gpa);
> +				goto vmabort;
> +			}
> +			if (h.index != g.index)
> +				continue;
> +			if (h.value == g.value)
> +				break;
> +
> +			if (nested_vmx_load_msr_check(vcpu, &h)) {
> +				pr_debug_ratelimited(
> +					"%s check failed (%u, 0x%x, 0x%x)\n",
> +					__func__, j, h.index, h.reserved);
> +				goto vmabort;
> +			}
> +
> +			msr.index = h.index;
> +			msr.data = h.value;
> +			if (kvm_set_msr(vcpu, &msr)) {
> +				pr_debug_ratelimited(
> +					"%s WRMSR failed (%u, 0x%x, 0x%llx)\n",
> +					__func__, j, h.index, h.value);
> +				goto vmabort;
> +			}
> +		}
> +	}
> +
> +	return;
> +
> +vmabort:
> +	nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_MSR_FAIL);
> +}
> +
>  /*
>   * Emulate an exit from nested guest (L2) to L1, i.e., prepare to run L1
>   * and modify vmcs12 to make it see what it would expect to see there if
> @@ -13323,7 +13450,13 @@ static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 exit_reason,
>  	 */
>  	nested_vmx_failValid(vcpu, VMXERR_ENTRY_INVALID_CONTROL_FIELD);
>  
> -	load_vmcs12_mmu_host_state(vcpu, vmcs12);
> +	/*
> +	 * Restore L1's host state to KVM's software model.  We're here
> +	 * because a consistency check was caught by hardware, which
> +	 * means some amount of guest state has been propagated to KVM's
> +	 * model and needs to be unwound to the host's state.
> +	 */
> +	nested_vmx_restore_host_state(vcpu);
>  
>  	/*
>  	 * The emulated instruction was already skipped in
>