I may be completely off track, but I am wondering whether uses of the GUEST_PHYSICAL_ADDRESS field from vmcs are wrong when operating under nested virtualization. My worry is that during L2->L0 vmexits the field represents nested GPAs, but KVM is expecting it to be a GPA. In particular: - if PML is active, the value that is written to the L0 page modification log is a nested GPA instead of a GPA. This would break nested dirty tracking when PML is active. - handle_ept_misconfig would be completely broken if a guest is showing an I/O range to the nested guest. I'm sure the Google guys have figured this out already and are just testing how dumb this maintainer is. ;) In any case, a lightly- tested patch follows my sig for the second issue. Paolo ---------------- 8< ------------- >From 2ffb87523d8447b7505127e497bb20d1ec61a5f3 Mon Sep 17 00:00:00 2001 From: Paolo Bonzini <pbonzini@xxxxxxxxxx> Date: Fri, 11 Aug 2017 17:45:55 +0200 Subject: [PATCH] KVM: x86: fix use of L1 MMIO areas in nested guests There is currently some confusion between nested and L1 GPAs. The assignment to "direct" in kvm_mmu_page_fault tries to fix that, but it is not enough. What this patch does is fence off the MMIO cache completely when using shadow nested page tables, since we have neither a GVA nor an L1 GPA to put in the cache. This also allows some simplifications in kvm_mmu_page_fault and FNAME(page_fault). The EPT misconfig has two changes done. First, it does not have an L1 GPA to pass to kvm_io_bus_write, so it must be skipped for guest mode. Second, calling handle_mmio_page_fault() has been unnecessary since commit e9ee956e311d ("KVM: x86: MMU: Move handle_mmio_page_fault() call to kvm_mmu_page_fault()", 2016-02-22) Signed-off-by: Paolo Bonzini <pbonzini@xxxxxxxxxx> --- arch/x86/kvm/mmu.c | 10 +++++++++- arch/x86/kvm/paging_tmpl.h | 3 +-- arch/x86/kvm/vmx.c | 27 ++++++++++++--------------- arch/x86/kvm/x86.h | 6 +++++- 4 files changed, 27 insertions(+), 19 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 5e3c0f09eaa8..bedeeb633f80 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -3598,6 +3598,14 @@ static bool is_shadow_zero_bits_set(struct kvm_mmu *mmu, u64 spte, int level) static bool mmio_info_in_cache(struct kvm_vcpu *vcpu, u64 addr, bool direct) { + /* + * A nested guest cannot use the MMIO cache if it is using nested + * page tables, because cr2 is a nGPA while the cache stores L1's + * physical addresses. + */ + if (mmu_is_nested(vcpu)) + return false; + if (direct) return vcpu_match_mmio_gpa(vcpu, addr); @@ -4835,7 +4843,7 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u64 error_code, { int r, emulation_type = EMULTYPE_RETRY; enum emulation_result er; - bool direct = vcpu->arch.mmu.direct_map || mmu_is_nested(vcpu); + bool direct = vcpu->arch.mmu.direct_map; if (unlikely(error_code & PFERR_RSVD_MASK)) { r = handle_mmio_page_fault(vcpu, cr2, direct); diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index 3bb90ceeb52d..86b68dc5a649 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -790,8 +790,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code, &map_writable)) return 0; - if (handle_abnormal_pfn(vcpu, mmu_is_nested(vcpu) ? 0 : addr, - walker.gfn, pfn, walker.pte_access, &r)) + if (handle_abnormal_pfn(vcpu, addr, walker.gfn, pfn, walker.pte_access, &r)) return r; /* diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 283e94de3853..d9e9bf96edb3 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -6405,24 +6405,21 @@ static int handle_ept_misconfig(struct kvm_vcpu *vcpu) int ret; gpa_t gpa; + /* + * A nested guest cannot optimize MMIO vmexits, because we have an + * nGPA here instead of the required GPA. + */ gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS); - if (!kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS, gpa, 0, NULL)) { - trace_kvm_fast_mmio(gpa); - return kvm_skip_emulated_instruction(vcpu); + if (!is_guest_mode(vcpu)) { + if (!kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS, gpa, 0, NULL)) { + trace_kvm_fast_mmio(gpa); + return kvm_skip_emulated_instruction(vcpu); + } } - ret = handle_mmio_page_fault(vcpu, gpa, true); - vcpu->arch.gpa_available = !is_guest_mode(vcpu); - vcpu->arch.gpa_val = gpa; - if (likely(ret == RET_MMIO_PF_EMULATE)) - return x86_emulate_instruction(vcpu, gpa, 0, NULL, 0) == - EMULATE_DONE; - - if (unlikely(ret == RET_MMIO_PF_INVALID)) - return kvm_mmu_page_fault(vcpu, gpa, 0, NULL, 0); - - if (unlikely(ret == RET_MMIO_PF_RETRY)) - return 1; + ret = kvm_mmu_page_fault(vcpu, gpa, PFERR_RSVD_MASK, NULL, 0); + if (ret >= 0) + return ret; /* It is the real ept misconfig */ WARN_ON(1); diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h index 612067074905..2383d2ce0a84 100644 --- a/arch/x86/kvm/x86.h +++ b/arch/x86/kvm/x86.h @@ -90,7 +90,11 @@ static inline u32 bit(int bitno) static inline void vcpu_cache_mmio_info(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn, unsigned access) { - vcpu->arch.mmio_gva = gva & PAGE_MASK; + /* + * If this is a shadow nested page table, the "GVA" is + * actually a nested GPA. + */ + vcpu->arch.mmio_gva = mmu_is_nested(vcpu) ? 0 : gva & PAGE_MASK; vcpu->arch.access = access; vcpu->arch.mmio_gfn = gfn; vcpu->arch.mmio_gen = kvm_memslots(vcpu->kvm)->generation; -- 1.8.3.1