On Mon, Feb 19, 2024 at 11:44:54AM -0800, Sean Christopherson wrote: > +Jim > > On Mon, Feb 19, 2024, Yan Zhao wrote: > > On Fri, Feb 09, 2024 at 02:28:57PM -0800, Sean Christopherson wrote: > > > @@ -4406,6 +4379,37 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, > > > fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq; > > > smp_rmb(); > > > > > > + if (!slot) > > > + goto faultin_pfn; > > > + > > > + /* > > > + * Retry the page fault if the gfn hit a memslot that is being deleted > > > + * or moved. This ensures any existing SPTEs for the old memslot will > > > + * be zapped before KVM inserts a new MMIO SPTE for the gfn. > > > + */ > > > + if (slot->flags & KVM_MEMSLOT_INVALID) > > > + return RET_PF_RETRY; > > > + > > > + if (!kvm_is_visible_memslot(slot)) { > > > + /* Don't expose KVM's internal memslots to L2. */ > > > + if (is_guest_mode(vcpu)) { > > > + fault->slot = NULL; > > > + fault->pfn = KVM_PFN_NOSLOT; > > > + fault->map_writable = false; > > > + return RET_PF_CONTINUE; > > Call kvm_handle_noslot_fault() to replace returning RET_PF_CONTINUE? > > Oof. Yes. But there is a pre-existing bug here too, though it's very theoretical > and unlikely to ever cause problems. > > If KVM is using TDP, but L1 is using shadow paging for L2, then routing through > kvm_handle_noslot_fault() will incorrectly cache the gfn as MMIO, and create an > MMIO SPTE. Creating an MMIO SPTE is ok, but only because kvm_mmu_page_role.guest_mode > ensure KVM uses different roots for L1 vs. L2. But mmio_gfn will remain valid, > and could (quite theoretically) cause KVM to incorrectly treat an L1 access to > the private TSS or identity mapped page tables as MMIO. Why would KVM treat L1 access to the private TSS and identity mapped page tables as MMIO even though mmio_gfn is valid? It looks that (for Intel platform) EPT for L1 will only install normal SPTEs (non-MMIO SPTEs) for the two private slots, so there would not have EPT misconfiguration and would not go to emulation path incorrectly. Am I missing something? > Furthermore, this check doesn't actually prevent exposing KVM's internal memslots > to L2, it simply forces KVM to emulate the access. In most cases, that will trigger > MMIO, amusingly due to filling arch.mmio_gfn. And vcpu_is_mmio_gpa() always > treats APIC accesses as MMIO, so those are fine. But the private TSS and identity > mapped page tables could go either way (MMIO or access the private memslot's backing > memory). Yes, this is also my question when reading that part of code. mmio_gen mismatching may lead to accessing the backing memory directly. > We could "fix" the issue by forcing MMIO emulation for L2 access to all internal > memslots, not just to the APIC. But I think that's actually less correct than > letting L2 access the private TSS and indentity mapped page tables (not to mention > that I can't imagine anyone cares what KVM does in this case). From L1's perspective, > there is R/W memory at those memslot, the memory just happens to be initialized > with non-zero data, and I don't see a good argument for hiding that memory from L2. > Making the memory disappear is far more magical than the memory existing in the > first place. > > The APIC access page is special because KVM _must_ emulate the access to do the > right thing. And despite what commit 3a2936dedd20 ("kvm: mmu: Don't expose private > memslots to L2") said, it's not just when L1 is accelerating L2's virtual APIC, > it's just as important (likely *more* imporant* for correctness when L1 is passing > through its own APIC to L2. > > Unless I'm missing someting, I think it makes sense to throw in the below before > moving code around. > > Ouch, and looking at this further, patch 1 introduced a bug (technically) by caching > fault->slot; in this case KVM will unnecessarily check mmu_notifiers. That's > obviously a very benign bug, as a false positive just means an unnecessary retry, > but yikes. > Patch 3 & 4 removed the bug immediately :) > -- > Subject: [PATCH] KVM: x86/mmu: Don't force emulation of L2 accesses to > non-APIC internal slots > > Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx> > --- > arch/x86/kvm/mmu/mmu.c | 17 +++++++++++++---- > 1 file changed, 13 insertions(+), 4 deletions(-) > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index 488f522f09c6..4ce824cec5b9 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -4341,8 +4341,18 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault > if (slot && (slot->flags & KVM_MEMSLOT_INVALID)) > return RET_PF_RETRY; > > - if (!kvm_is_visible_memslot(slot)) { > - /* Don't expose private memslots to L2. */ > + if (slot && slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT) { > + /* > + * Don't map L1's APIC access page into L2, KVM doesn't support > + * using APICv/AVIC to accelerate L2 accesses to L1's APIC, > + * i.e. the access needs to be emulated. Emulating access to > + * L1's APIC is also correct if L1 is accelerating L2's own > + * virtual APIC, but for some reason L1 also maps _L1's_ APIC > + * into L2. Note, vcpu_is_mmio_gpa() always treats access to > + * the APIC as MMIO. Allow an MMIO SPTE to be created, as KVM > + * uses different roots for L1 vs. L2, i.e. there is no danger > + * of breaking APICv/AVIC for L1. > + */ > if (is_guest_mode(vcpu)) { > fault->slot = NULL; > fault->pfn = KVM_PFN_NOSLOT; Checking fault->is_private before calling kvm_handle_noslot_fault()? And do we need a centralized check of fault->is_private in kvm_mmu_do_page_fault() before returning RET_PF_EMULATE? > @@ -4355,8 +4365,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault > * MMIO SPTE. That way the cache doesn't need to be purged > * when the AVIC is re-enabled. > */ > - if (slot && slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT && > - !kvm_apicv_activated(vcpu->kvm)) > + if (!kvm_apicv_activated(vcpu->kvm)) > return RET_PF_EMULATE; Otherwise, here also needs a checking of fault->is_private? Maybe also for where RET_PF_EMULATE is returned when page_fault_handle_page_track() is true (though I know it's always false for TDX). > } > > > base-commit: ec98c2c1a07fb341ba2230eab9a31065d12d9de6 > --