On 20/05/19 22:10, Sean Christopherson wrote: > KVM does not have 100% coverage of VMX consistency checks, i.e. some > checks that cause VM-Fail may only be detected by hardware during a > nested VM-Entry. In such a case, KVM must restore L1's state to the > pre-VM-Enter state as L2's state has already been loaded into KVM's > software model. > > L1's CR3 and PDPTRs in particular are loaded from vmcs01.GUEST_*. But > when EPT is disabled, the associated fields hold KVM's shadow values, > not L1's "real" values. Fortunately, when EPT is disabled the PDPTRs > come from memory, i.e. are not cached in the VMCS. Which leaves CR3 > as the sole anomaly. Handle CR3 by overwriting vmcs01.GUEST_CR3 with > L1's CR3 during the nested VM-Entry when EPT is disabled *and* nested > early checks are disabled, so that nested_vmx_restore_host_state() will > naturally restore the correct vcpu->arch.cr3 from vmcs01.GUEST_CR3. > > Note, these shenanigans work because nested_vmx_restore_host_state() > does a full kvm_mmu_reset_context(), i.e. unloads the current MMU, > which guarantees vmcs01.GUEST_CR3 will be rewritten with a new shadow > CR3 prior to re-entering L1. Writing vmcs01.GUEST_CR3 is done if and > only if nested early checks are disabled as "late" VM-Fail should never > happen in that case (KVM WARNs), and the conditional write avoids the > need to restore the correct GUEST_CR3 when nested_vmx_check_vmentry_hw() > fails. > > Reported-by: Paolo Bonzini <pbonzini@xxxxxxxxxx> > Fixes: bd18bffca353 ("KVM: nVMX: restore host state in nested_vmx_vmexit for VMFail") > Cc: stable@xxxxxxxxxxxxxxx > Signed-off-by: Sean Christopherson <sean.j.christopherson@xxxxxxxxx> > --- > arch/x86/kvm/vmx/nested.c | 5 ++++- > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c > index 1032f068f0b9..92117092f6e9 100644 > --- a/arch/x86/kvm/vmx/nested.c > +++ b/arch/x86/kvm/vmx/nested.c > @@ -2963,6 +2963,8 @@ int nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu, bool from_vmentry) > if (kvm_mpx_supported() && > !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS)) > vmx->nested.vmcs01_guest_bndcfgs = vmcs_read64(GUEST_BNDCFGS); This hunk needs a comment that says basically the same things that are in the commit message. /* Temporarily overwrite vmcs01.GUEST_CR3 with L1's CR3 during * the nested VM-Entry when EPT is disabled *and* nested early * checks are disabled, because nested_vmx_restore_host_state() * wants to restore the correct vcpu->arch.cr3 from * vmcs01.GUEST_CR3. On nested vmexit, kvm_mmu_reset_context() * will overwrite GUEST_CR3 with the shadow CR3 prior to * re-entering L1. This is not needed when nested early checks * are enabled, because it doesn't reload the MMU until * after the checks have succeeded. */ And also, the restore of the correct vcpu->arch.cr3 from vmcs01.GUEST_CR3 can be moved to this patch otherwise the comment would not be accurate. But, as I said in the review of patch 2, I'm a bit lost as to how kvm_mmu_reset_context() is enough to reload the shadow CR3 into the vmcs01. Paolo > + if (!enable_ept && !nested_early_check) > + vmcs_writel(GUEST_CR3, vcpu->arch.cr3); > > vmx_switch_vmcs(vcpu, &vmx->nested.vmcs02); > > @@ -3794,7 +3796,8 @@ static void nested_vmx_restore_host_state(struct kvm_vcpu *vcpu) > * VMFail, like everything else we just need to ensure our > * software model is up-to-date. > */ > - ept_save_pdptrs(vcpu); > + if (enable_ept) > + ept_save_pdptrs(vcpu); > > kvm_mmu_reset_context(vcpu); > >