On Thu, Jan 06, 2022, Lai Jiangshan wrote: > > > On 2022/1/6 00:45, Sean Christopherson wrote: > > On Wed, Jan 05, 2022, Lai Jiangshan wrote: > > > On Wed, Jan 5, 2022 at 5:54 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > > > > > > > > > > > > default_pae_pdpte is needed because the cpu expect PAE pdptes are > > > > > present when VMenter. > > > > > > > > That's incorrect. Neither Intel nor AMD require PDPTEs to be present. Not present > > > > is perfectly ok, present with reserved bits is what's not allowed. > > > > > > > > Intel SDM: > > > > A VM entry that checks the validity of the PDPTEs uses the same checks that are > > > > used when CR3 is loaded with MOV to CR3 when PAE paging is in use[7]. If MOV to CR3 > > > > would cause a general-protection exception due to the PDPTEs that would be loaded > > > > (e.g., because a reserved bit is set), the VM entry fails. > > > > > > > > 7. This implies that (1) bits 11:9 in each PDPTE are ignored; and (2) if bit 0 > > > > (present) is clear in one of the PDPTEs, bits 63:1 of that PDPTE are ignored. > > > > > > But in practice, the VM entry fails if the present bit is not set in the > > > PDPTE for the linear address being accessed (when EPT enabled at least). The > > > host kvm complains and dumps the vmcs state. > > > > That doesn't make any sense. If EPT is enabled, KVM should never use a pae_root. > > The vmcs.GUEST_PDPTRn fields are in play, but those shouldn't derive from KVM's > > shadow page tables. > > Oh, I wrote the negative what I want to say again when I try to emphasis > something after I wrote a sentence and modified it several times. > > I wanted to mean "EPT not enabled" when vmx. Heh, that makes a lot more sense. > The VM entry fails when the guest is in very early stage when booting which > might be still in real mode. > > VMEXIT: intr_info=00000000 errorcode=0000000 ilen=00000000 > reason=80000021 qualification=0000000000000002 Yep, that's the signature for an illegal PDPTE at VM-Enter. But as noted above, a not-present PDPTE is perfectly legal, VM-Enter should failed if and only if a PDPTE is present and has reserved bits set. > IDTVectoring: info=00000000 errorcode=00000000 > > > > > And I doubt there is a VMX ucode bug at play, as KVM currently uses '0' in its > > shadow page tables for not-present PDPTEs. > > > > If you can post/provide the patches that lead to VM-Fail, I'd be happy to help > > debug. > > If you can try this patchset, you can just set the default_pae_pdpte to 0 to test > it. I can't reproduce the failure with this on top of your series + kvm/queue (commit cc0e35f9c2d4 ("KVM: SVM: Nullify vcpu_(un)blocking() hooks if AVIC is disabled")). diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index f6f7caf76b70..b7170a840330 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -728,22 +728,11 @@ static u64 default_pae_pdpte; static void free_default_pae_pdpte(void) { - free_page((unsigned long)__va(default_pae_pdpte & PAGE_MASK)); default_pae_pdpte = 0; } static int alloc_default_pae_pdpte(void) { - unsigned long p = __get_free_page(GFP_KERNEL | __GFP_ZERO); - - if (!p) - return -ENOMEM; - default_pae_pdpte = __pa(p) | PT_PRESENT_MASK | shadow_me_mask; - if (WARN_ON(is_shadow_present_pte(default_pae_pdpte) || - is_mmio_spte(default_pae_pdpte))) { - free_default_pae_pdpte(); - return -EINVAL; - } return 0; } Are you using a different base and/or running with other changes? To aid debug, the below patch will dump the PDPTEs from the current MMU root on failure (I'll also submit this as a formal patch). On failure, I would expect that at least one of the PDPTEs will be present with reserved bits set. diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index fe06b02994e6..c13f37ef1bbc 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -5773,11 +5773,19 @@ void dump_vmcs(struct kvm_vcpu *vcpu) pr_err("CR4: actual=0x%016lx, shadow=0x%016lx, gh_mask=%016lx\n", cr4, vmcs_readl(CR4_READ_SHADOW), vmcs_readl(CR4_GUEST_HOST_MASK)); pr_err("CR3 = 0x%016lx\n", vmcs_readl(GUEST_CR3)); - if (cpu_has_vmx_ept()) { + if (enable_ept) { pr_err("PDPTR0 = 0x%016llx PDPTR1 = 0x%016llx\n", vmcs_read64(GUEST_PDPTR0), vmcs_read64(GUEST_PDPTR1)); pr_err("PDPTR2 = 0x%016llx PDPTR3 = 0x%016llx\n", vmcs_read64(GUEST_PDPTR2), vmcs_read64(GUEST_PDPTR3)); + } else if (vcpu->arch.mmu->shadow_root_level == PT32E_ROOT_LEVEL && + VALID_PAGE(vcpu->arch.mmu->root_hpa)) { + u64 *pdpte = __va(vcpu->arch.mmu->root_hpa); + + pr_err("PDPTE0 = 0x%016llx PDPTE1 = 0x%016llx\n", + pdpte[0], pdpte[1]); + pr_err("PDPTE2 = 0x%016llx PDPTE3 = 0x%016llx\n", + pdpte[2], pdpte[3]); } pr_err("RSP = 0x%016lx RIP = 0x%016lx\n", vmcs_readl(GUEST_RSP), vmcs_readl(GUEST_RIP)); > If you can't try this patchset, the mmu->pae_root can be possible to be modified > to test it. > > I guess the vmx fails to translate %rip when VMentry in this case. No, the CPU doesn't translate RIP at VM-Enter, vmcs.GUEST_RIP is only checked for legality, e.g. that it's canonical. Translating RIP through page tables is firmly a post-VM-Enter code fetch action.