On Sat, Apr 09, 2022, Sean Christopherson wrote: > Account and track NX huge pages for nonpaging MMUs so that a future > enhancement to precisely check if shadow page cannot be replaced by a NX > huge page doesn't get false positives. Without correct tracking, KVM can > get stuck in a loop if an instruction is fetching and writing data on the > same huge page, e.g. KVM installs a small executable page on the fetch > fault, replaces it with an NX huge page on the write fault, and faults > again on the fetch. > > Alternatively, and perhaps ideally, KVM would simply not enforce the > workaround for nonpaging MMUs. The guest has no page tables to abuse > and KVM is guaranteed to switch to a different MMU on CR0.PG being > toggled so there're no security or performance concerns. But getting > make_spte() to play nice now and in the future is unnecessarily complex. > In the current code base, make_spte() can enforce the mitigation if TDP > is enabled or the MMU is indirect, but other in-flight patches aim to > drop the @vcpu param[*]. Without a @vcpu, KVM could either pass in the > correct information and/or derive it from the shadow page, but the former > is ugly and the latter subtly non-trivial due to the possitibility of > direct shadow pages in indirect MMUs. Given that using shadow paging > with an unpaged guest is far from top priority in terms of performance, > _and_ has been subjected to the workaround since its inception, keep it > simple and just fix the accounting glitch. > > [*] https://lore.kernel.org/all/20220321224358.1305530-5-bgardon@xxxxxxxxxx > > Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx> > --- > arch/x86/kvm/mmu.h | 9 +++++++++ > arch/x86/kvm/mmu/mmu.c | 2 +- > arch/x86/kvm/mmu/spte.c | 11 +++++++++++ > 3 files changed, 21 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h > index 671cfeccf04e..89df062d5921 100644 > --- a/arch/x86/kvm/mmu.h > +++ b/arch/x86/kvm/mmu.h > @@ -191,6 +191,15 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, > .user = err & PFERR_USER_MASK, > .prefetch = prefetch, > .is_tdp = likely(vcpu->arch.mmu->page_fault == kvm_tdp_page_fault), > + > + /* > + * Note, enforcing the NX huge page mitigation for nonpaging > + * MMUs (shadow paging, CR0.PG=0 in the guest) is completely > + * unnecessary. The guest doesn't have any page tables to > + * abuse and is guaranteed to switch to a different MMU when > + * CR0.PG is toggled on (may not always be guaranteed when KVM > + * is using TDP). See make_spte() for details. > + */ > .nx_huge_page_workaround_enabled = is_nx_huge_page_enabled(), hmm. I think there could be a minor issue here (even in original code). The nx_huge_page_workaround_enabled is attached here with page fault. However, at the time of make_spte(), we call is_nx_huge_page_enabled() again. Since this function will directly check the module parameter, there might be a race condition here. eg., at the time of page fault, the workround was 'true', while by the time we reach make_spte(), the parameter was set to 'false'. I have not figured out what the side effect is. But I feel like the make_spte() should just follow the information in kvm_page_fault instead of directly querying the global config. > > .max_level = KVM_MAX_HUGEPAGE_LEVEL, > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index d230d2d78ace..9416445afa3e 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -2954,7 +2954,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) > it.level - 1, true, ACC_ALL); > > link_shadow_page(vcpu, it.sptep, sp); > - if (fault->is_tdp && fault->huge_page_disallowed) > + if (fault->huge_page_disallowed) > account_nx_huge_page(vcpu->kvm, sp, > fault->req_level >= it.level); > } > diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c > index 4739b53c9734..14ad821cb0c7 100644 > --- a/arch/x86/kvm/mmu/spte.c > +++ b/arch/x86/kvm/mmu/spte.c > @@ -115,6 +115,17 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, > if (!prefetch) > spte |= spte_shadow_accessed_mask(spte); > > + /* > + * For simplicity, enforce the NX huge page mitigation even if not > + * strictly necessary. KVM could ignore if the mitigation if paging is > + * disabled in the guest, but KVM would then have to ensure a new MMU > + * is loaded (or all shadow pages zapped) when CR0.PG is toggled on, > + * and that's a net negative for performance when TDP is enabled. KVM > + * could ignore the mitigation if TDP is disabled and CR0.PG=0, as KVM > + * will always switch to a new MMU if paging is enabled in the guest, > + * but that adds complexity just to optimize a mode that is anything > + * but performance critical. > + */ > if (level > PG_LEVEL_4K && (pte_access & ACC_EXEC_MASK) && > is_nx_huge_page_enabled()) { > pte_access &= ~ACC_EXEC_MASK; > -- > 2.35.1.1178.g4f1659d476-goog >