On Mon, 2018-07-23 at 15:26 -0700, Krish Sadhukhan wrote: > > On 07/19/2018 10:56 AM, Sean Christopherson wrote: > > > > A VMEnter that VMFails (as opposed to VMExits) does not touch host > > state beyond registers that are explicitly noted in the VMFail path, > > e.g. EFLAGS. Host state does not need to be loaded because VMFail > > is only signaled for consistency checks that occur before the CPU > > starts to load guest state, i.e. there is no need to restore any > > state as nothing has been modified. But in the case where a VMFail > > is detected by hardware and not by KVM (due to deferring consistency > > checks to hardware), KVM has already loaded some amount of guest > > state. Luckily, "loaded" only means loaded to KVM's software model, > > i.e. vmcs01 has not been modified. So, unwind our software model to > > the pre-VMEntry host state. > > > > Not restoring host state in this VMFail path leads to a variety of > > failures because we end up with stale data in vcpu->arch, e.g. CR0, > > CR4, EFER, etc... will all be out of sync relative to vmcs01. Any > > significant delta in the stale data is all but guaranteed to crash > > L1, e.g. emulation of SMEP, SMAP, UMIP, WP, etc... will be wrong. > > > > An alternative to this "soft" reload would be to load host state from > > vmcs12 as if we triggered a VMExit (as opposed to VMFail), but that is > > wildly inconsistent with respect to the VMX architecture, e.g. an L1 > > VMM with separate VMExit and VMFail paths would explode. > > > > Note that this approach does not mean KVM is 100% accurate with > > respect to VMX hardware behavior, even at an architectural level > > (the exact order of consistency checks is microarchitecture specific) > > But 100% emulation accuracy isn't our goal, rather our goal is to be > > consistent in the information delivered to L1, e.g. a VMExit should > > not fall-through VMENTER, and a VMFail should not jump to HOST_RIP. > > Achieving perfect emulation for consistency checks is a fool's errand, > > e.g. KVM would need to manually detect *every* VMFail before checking > > and loading any guest state (a consistency check VMExit never takes > > priority over a consistency check VMFail). Even if KVM managed to > > achieve perfection (and didn't crater performance in the process), > > the honeymoon would only last until KVM encountered new hardware with > > new consistency checks. > > > > This technically reverts commit "5af4157388ad (KVM: nVMX: Fix mmu > > context after VMLAUNCH/VMRESUME failure)", but retains the core > > aspects of that patch, just in an open coded form due to the need to > > pull state from vmcs01 instead of vmcs12. Restoring host state > > resolves a variety of issues introduced by commit "4f350c6dbcb9 > > (kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure properly)", > > which remedied the incorrect behavior of treating VMFail like VMExit > > but in doing so neglected to restore arch state that had been modified > > prior to attempting nested VMEnter. > > > > A sample failure that occurs due to stale vcpu.arch state is a fault > > of some form while emulating an LGDT (due to emulated UMIP) from L1 > > L1 after a failed VMEntry to L3, in this case when running the KVM > Nit: "L1" is repeated twice here. > > > > > unit test test_tpr_threshold_values in L1. L0 also hits a WARN in > > this case due to a stale arch.cr4.UMIP. > > > > L1: > > BUG: unable to handle kernel paging request at ffffc90000663b9e > > PGD 276512067 P4D 276512067 PUD 276513067 PMD 274efa067 PTE 8000000271de2163 > > Oops: 0009 [#1] SMP > > CPU: 5 PID: 12495 Comm: qemu-system-x86 Tainted: G W 4.18.0-rc2+ #2 > > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 > > RIP: 0010:native_load_gdt+0x0/0x10 > > > > ... > > > > Call Trace: > > load_fixmap_gdt+0x22/0x30 > > __vmx_load_host_state+0x10e/0x1c0 [kvm_intel] > > vmx_switch_vmcs+0x2d/0x50 [kvm_intel] > > nested_vmx_vmexit+0x222/0x9c0 [kvm_intel] > > vmx_handle_exit+0x246/0x15a0 [kvm_intel] > > kvm_arch_vcpu_ioctl_run+0x850/0x1830 [kvm] > > kvm_vcpu_ioctl+0x3a1/0x5c0 [kvm] > > do_vfs_ioctl+0x9f/0x600 > > ksys_ioctl+0x66/0x70 > > __x64_sys_ioctl+0x16/0x20 > > do_syscall_64+0x4f/0x100 > > entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > > > L0: > > WARNING: CPU: 2 PID: 3529 at arch/x86/kvm/vmx.c:6618 handle_desc+0x28/0x30 [kvm_intel] > > ... > > CPU: 2 PID: 3529 Comm: qemu-system-x86 Not tainted 4.17.2-coffee+ #76 > > Hardware name: Intel Corporation Kabylake Client platform/KBL S > > RIP: 0010:handle_desc+0x28/0x30 [kvm_intel] > > > > ... > > > > Call Trace: > > kvm_arch_vcpu_ioctl_run+0x863/0x1840 [kvm] > > kvm_vcpu_ioctl+0x3a1/0x5c0 [kvm] > > do_vfs_ioctl+0x9f/0x5e0 > > ksys_ioctl+0x66/0x70 > > __x64_sys_ioctl+0x16/0x20 > > do_syscall_64+0x49/0xf0 > > entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > > > Fixes: 5af4157388ad (KVM: nVMX: Fix mmu context after VMLAUNCH/VMRESUME failure) > > Fixes: 4f350c6dbcb9 (kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure properly) > > Cc: Jim Mattson <jmattson@xxxxxxxxxx> > > Cc: Krish Sadhukhan <krish.sadhukhan@xxxxxxxxxx> > > Cc: Paolo Bonzini <pbonzini@xxxxxxxxxx> > > Cc: Radim Krčmář <rkrcmar@xxxxxxxxxx> > > Cc: Wanpeng Li <wanpeng.li@xxxxxxxxxxx> > > Signed-off-by: Sean Christopherson <sean.j.christopherson@xxxxxxxxx> > > --- > > arch/x86/kvm/vmx.c | 173 +++++++++++++++++++++++++++++++++++++++------ > > 1 file changed, 153 insertions(+), 20 deletions(-) > > > > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c > > index 1689f433f3a0..eb5f49d2c46d 100644 > > --- a/arch/x86/kvm/vmx.c > > +++ b/arch/x86/kvm/vmx.c > > @@ -12199,24 +12199,6 @@ static void prepare_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12, > > kvm_clear_interrupt_queue(vcpu); > > } > > > > -static void load_vmcs12_mmu_host_state(struct kvm_vcpu *vcpu, > > - struct vmcs12 *vmcs12) > > -{ > > - u32 entry_failure_code; > > - > > - nested_ept_uninit_mmu_context(vcpu); > > - > > - /* > > - * Only PDPTE load can fail as the value of cr3 was checked on entry and > > - * couldn't have changed. > > - */ > > - if (nested_vmx_load_cr3(vcpu, vmcs12->host_cr3, false, &entry_failure_code)) > > - nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL); > > - > > - if (!enable_ept) > > - vcpu->arch.walk_mmu->inject_page_fault = kvm_inject_page_fault; > > -} > > - > > /* > > * A part of what we need to when the nested L2 guest exits and we want to > > * run its L1 parent, is to reset L1's guest state to the host state specified > > @@ -12230,6 +12212,7 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu, > > struct vmcs12 *vmcs12) > > { > > struct kvm_segment seg; > > + u32 entry_failure_code; > > > > if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_EFER) > > vcpu->arch.efer = vmcs12->host_ia32_efer; > > @@ -12256,7 +12239,17 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu, > > vcpu->arch.cr4_guest_owned_bits = ~vmcs_readl(CR4_GUEST_HOST_MASK); > > vmx_set_cr4(vcpu, vmcs12->host_cr4); > > > > - load_vmcs12_mmu_host_state(vcpu, vmcs12); > > + nested_ept_uninit_mmu_context(vcpu); > > + > > + /* > > + * Only PDPTE load can fail as the value of cr3 was checked on entry and > > + * couldn't have changed. > > + */ > > + if (nested_vmx_load_cr3(vcpu, vmcs12->host_cr3, false, &entry_failure_code)) > > + nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL); > > + > > + if (!enable_ept) > > + vcpu->arch.walk_mmu->inject_page_fault = kvm_inject_page_fault; > > > > /* > > * If vmcs01 don't use VPID, CPU flushes TLB on every > > @@ -12352,6 +12345,140 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu, > > nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_MSR_FAIL); > > } > > > > +static inline u64 nested_vmx_get_vmcs01_guest_efer(struct vcpu_vmx *vmx) > > +{ > > + struct shared_msr_entry *efer_msr; > > + unsigned int i; > > + > > + if (vm_entry_controls_get(vmx) & VM_ENTRY_LOAD_IA32_EFER) > > + return vmcs_read64(GUEST_IA32_EFER); > > + > > + if (cpu_has_load_ia32_efer) > > + return host_efer; > > + > > + for (i = 0; i < vmx->msr_autoload.nr; ++i) { > > + if (vmx->msr_autoload.guest[i].index == MSR_EFER) > > + return vmx->msr_autoload.guest[i].value; > > + } > > + > > + efer_msr = find_msr_entry(vmx, MSR_EFER); > > + if (efer_msr) > > + return efer_msr->data; > > + > > + return host_efer; > > +} > > + > > +static void nested_vmx_restore_host_state(struct kvm_vcpu *vcpu) > For readability purpose, should we add some comments outlining the > prominent elements that we are restoring in this function ? Hmm what about a high level comment? I'm hesistant to list things explicitly as that tends to result in stale comments. And I didn't have a method for choosing what (not) to restore beyond ensuring everything that might have been touched in enter_vmx_non_root_mode() was restored. Something like: Unwind KVM's software model, i.e. vcpu->arch, to L1's pre-VMEnter state, i.e. L1 host state, pulling data from vmcs01. The rule of thumb is that anything touched by enter_vmx_non_root_mode() needs to be restored. > > > > +{ > > + struct vmcs12 *vmcs12 = get_vmcs12(vcpu); > > + struct vcpu_vmx *vmx = to_vmx(vcpu); > > + struct vmx_msr_entry g, h; > > + struct msr_data msr; > > + gpa_t gpa; > > + u32 i, j; > > + > > + vcpu->arch.pat = vmcs_read64(GUEST_IA32_PAT); > > + > > + if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_DEBUG_CONTROLS) { > > + /* > > + * L1's host DR7 is lost if KVM_GUESTDBG_USE_HW_BP is set > > + * as vmcs01.GUEST_DR7 contains a userspace defined value > > + * and vcpu->arch.dr7 is not squirreled away before the > > + * nested VMENTER (not worth adding a variable in nested_vmx). > > + */ > > + if (vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP) > > + kvm_set_dr(vcpu, 7, DR7_FIXED_1); > > + else > > + WARN_ON(kvm_set_dr(vcpu, 7, vmcs_readl(GUEST_DR7))); > > + } > > + > > + /* > > + * Note that calling vmx_set_{efer,cr0,cr4} is important as they > > + * handle a variety of side effects to KVM's software model. > > + */ > > + vmx_set_efer(vcpu, nested_vmx_get_vmcs01_guest_efer(vmx)); > > + > > + vcpu->arch.cr0_guest_owned_bits = X86_CR0_TS; > > + vmx_set_cr0(vcpu, vmcs_readl(CR0_READ_SHADOW)); > > + > > + vcpu->arch.cr4_guest_owned_bits = ~vmcs_readl(CR4_GUEST_HOST_MASK); > > + vmx_set_cr4(vcpu, vmcs_readl(CR4_READ_SHADOW)); > > + > > + nested_ept_uninit_mmu_context(vcpu); > > + vcpu->arch.cr3 = vmcs_readl(GUEST_CR3); > > + __set_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail); > > + > > + /* > > + * Use ept_save_pdptrs(vcpu) to load the MMU's cached PDPTRs > > + * from vmcs01 (if necessary). The PDPTRs are not loaded on > > + * VMFail, like everything else we just need to ensure our > > + * software model is up-to-date. > > + */ > > + ept_save_pdptrs(vcpu); > > + > > + kvm_mmu_reset_context(vcpu); > > + > > + if (cpu_has_vmx_msr_bitmap()) > > + vmx_update_msr_bitmap(vcpu); > > + > > + /* > > + * This nasty bit of open coding is a compromise between blindly > > + * loading L1's MSRs using the exit load lists (incorrect emulation > > + * of VMFail), leaving the nested VM's MSRs in the software model > > + * (incorrect behavior) and snapshotting the modified MSRs (too > > + * expensive since the lists are unbound by hardware). For each > > + * MSR that was (prematurely) loaded from the nested VMEntry load > > + * list, reload it from the exit load list if it exists and differs > > + * from the guest value. The intent is to stuff host state as > > + * silently as possible, not to fully process the exit load list. > > + */ > > + msr.host_initiated = false; > > + for (i = 0; i < vmcs12->vm_entry_msr_load_count; i++) { > > + gpa = vmcs12->vm_entry_msr_load_addr + (i * sizeof(g)); > > + if (kvm_vcpu_read_guest(vcpu, gpa, &g, sizeof(g))) { > > + pr_debug_ratelimited( > > + "%s read MSR index failed (%u, 0x%08llx)\n", > > + __func__, i, gpa); > > + goto vmabort; > > + } > > + > > + for (j = 0; j < vmcs12->vm_exit_msr_load_count; j++) { > > + gpa = vmcs12->vm_exit_msr_load_addr + (j * sizeof(h)); > > + if (kvm_vcpu_read_guest(vcpu, gpa, &h, sizeof(h))) { > > + pr_debug_ratelimited( > > + "%s read MSR failed (%u, 0x%08llx)\n", > > + __func__, j, gpa); > > + goto vmabort; > > + } > > + if (h.index != g.index) > > + continue; > > + if (h.value == g.value) > > + break; > > + > > + if (nested_vmx_load_msr_check(vcpu, &h)) { > > + pr_debug_ratelimited( > > + "%s check failed (%u, 0x%x, 0x%x)\n", > > + __func__, j, h.index, h.reserved); > > + goto vmabort; > > + } > > + > > + msr.index = h.index; > > + msr.data = h.value; > > + if (kvm_set_msr(vcpu, &msr)) { > > + pr_debug_ratelimited( > > + "%s WRMSR failed (%u, 0x%x, 0x%llx)\n", > > + __func__, j, h.index, h.value); > > + goto vmabort; > > + } > > + } > > + } > > + > Is it possible to re-use nested_vmx_load_msr() instead of duplicating > the same loops ? We could use nested_vmx_load_msr() as-is to directly load on-exit list, but that has the downside of potentially overwriting MSRs that should not have been touched, or triggering side effects that should not have occurred. I considered modifying nested_vmx_load_msr() to splice in this optional behavior, but the resulting code would be hideous and I didn't want to convolute an otherwise straightforward function for this one-off case. > > > > + return; > > + > > +vmabort: > > + nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_MSR_FAIL); > > +} > > + > > /* > > * Emulate an exit from nested guest (L2) to L1, i.e., prepare to run L1 > > * and modify vmcs12 to make it see what it would expect to see there if > > @@ -12490,7 +12617,13 @@ static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 exit_reason, > > */ > > nested_vmx_failValid(vcpu, VMXERR_ENTRY_INVALID_CONTROL_FIELD); > > > > - load_vmcs12_mmu_host_state(vcpu, vmcs12); > > + /* > > + * Restore L1's host state to KVM's software model. We're here > > + * because a consistency check was caught by hardware, which > > + * means some amount of guest state has been propagated to KVM's > > + * model and needs to be unwound to the host's state. > > + */ > > + nested_vmx_restore_host_state(vcpu); > > > > /* > > * The emulated instruction was already skipped in