On Tue, Mar 29, 2022, Sean Christopherson wrote: > +Yosry > > On Wed, Mar 23, 2022, Mingwei Zhang wrote: > > Add extra check to specify the case of nx hugepage and allow KVM to > > reconstruct large mapping after dirty logging is disabled. Existing code > > works only for nx hugepage but the condition is too general in that does > > not consider other usage case (such as dirty logging). Note that > > when dirty logging is disabled, KVM calls kvm_mmu_zap_collapsible_sptes() > > which only zaps leaf SPTEs. > > This opening is very, very confusing. A big part of the confusion is due to poor > naming in KVM, where kvm_mmu_page.lpage_disallowed really should be > nx_huge_page_disalowed. Enabling dirty logging also disables huge pages, but that > goes through the memslot's disallow_lpage. Reading through this paragraph, and > looking at the code, it sounds like disallowed_hugepage_adjust() is explicitly > checking if a page is disallowed due to dirty logging, which is not the case. > > I'd prefer the changelog lead with the bug it's fixing and only briefly mention > dirty logging as a scenario where KVM can end up with shadow pages without child > SPTEs. Such scenarios can also happen with mmu_notifier calls, etc... > > E.g. > > Explicitly check if a NX huge page is disallowed when determining if a page > fault needs to be forced to use a smaller sized page. KVM incorrectly > assumes that the NX huge page mitigation is the only scenario where KVM > will create a shadow page instead of a huge page. Any scenario that causes > KVM to zap leaf SPTEs may result in having a SP that can be made huge > without violating the NX huge page mitigation. E.g. disabling of dirty > logging, zapping from mmu_notifier due to page migration, guest MTRR changes > that affect the viability of a huge page, etc... > Thanks for the correction. yeah, I hit this bug when I rebase the internal demand paging. Other than that, the only user that will trigger this issue is dirty logging. So this would be the motivation. > > Moreover, existing code assumes that a present PMD or PUD indicates that > > there exist 'smaller SPTEs' under the paging structure. This assumption may > > no be true if KVM zaps only leafs in MMU. > > > > Missing the check causes KVM incorrectly regards the faulting page as a NX > > huge page and refuse to map it at desired level. And this leads to back > > performance issue in shadow mmu and potentially in TDP mmu as well. > > > > Fixes: b8e8c8303ff2 ("kvm: mmu: ITLB_MULTIHIT mitigation") > > Cc: stable@xxxxxxxxxxxxxxx I remove the Cc to stable, since this patch alone may have a race in TDP mmu. So will include both patches you attached in the 2nd version. > > I vote to keep the Fixes tag, but drop the Cc: stable@ and NAK any MANUALSEL/AUTOSEL > for backporting this to stable trees. Yes, it's a bug, but the NX huge page zapping > kthread will (eventually) reclaim the lost performance. On the flip side, if there's > an edge case we mess up (see below), then we've introduced a far worse bug. > > > Reviewed-by: Ben Gardon <bgardon@xxxxxxxxxx> > > Signed-off-by: Mingwei Zhang <mizhang@xxxxxxxxxx> > > --- > > arch/x86/kvm/mmu/mmu.c | 14 ++++++++++++-- > > 1 file changed, 12 insertions(+), 2 deletions(-) > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > > index 5628d0ba637e..d9b2001d8217 100644 > > --- a/arch/x86/kvm/mmu/mmu.c > > +++ b/arch/x86/kvm/mmu/mmu.c > > @@ -2919,6 +2919,16 @@ void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_ > > cur_level == fault->goal_level && > > is_shadow_present_pte(spte) && > > !is_large_pte(spte)) { > > + struct kvm_mmu_page *sp; > > + u64 page_mask; > > + /* > > + * When nx hugepage flag is not set, there is no reason to go > > + * down to another level. This helps KVM re-generate large > > + * mappings after dirty logging disabled. > > + */ > > Honestly, I'd just omit the comment. Again, IMO the blurb about dirty logging > does more harm than good because it makes it sound like this is somehow unique to > dirty logging, which it is not. Lack of a check was really just a simple goof. > > > + sp = to_shadow_page(spte & PT64_BASE_ADDR_MASK); > > + if (!sp->lpage_disallowed) > > This is unsafe for the TDP MMU. If mmu_lock is held for read, tdp_mmu_link_sp() > could be in the process of creating the shadow page for a different fault. > Critically, tdp_mmu_link_sp() sets lpage_disallowed _after_ installing the SPTE. > Thus this code could see the present shadow page, with the correct old_spte (and > thus succeed its eventual tdp_mmu_set_spte_atomic()), but with the wrong > lpage_disallowed. > > To fix, tdp_mmu_link_sp() needs to tag the shadow page with lpage_disallowed before > setting the SPTE, and also needs appropriate memory barriers. It's a bit messy at > first glance, but actually presents an opportunity to further improve TDP MMU > performance. tdp_mmu_pages can be turned into a counter, at which point the > link/unlock paths don't need to acquire the spinlock except for NX huge page case. > > Yosry (Cc'd) is also looking into adding stats for the number of page table pages > consumed by KVM, turning tdp_mmu_pages into a counter would trivialize that for > the TDP MMU (the stats for the TDP MMU and old MMU are best kept separated, see > the details in the attached patch). > > Unlike the legacy MMU, the TDP MMU never re-links existing shadow pages. This means > it doesn't need to worry about the scenario where lpage_disallowed was already set. > So, setting the flag prior to the SPTE is trivial and doesn't need to be unwound. > > Disclaimer: attached patches are lightly tested. ack. > > On top, this patch would need to add barriers, e.g. > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index 5cb845fae56e..0bf85bf3da64 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -2896,6 +2896,12 @@ void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_ > cur_level == fault->goal_level && > is_shadow_present_pte(spte) && > !is_large_pte(spte)) { > + /* comment goes here. */ > + smp_rmb(); > + > + if (!sp->lpage_disallowed) > + return; > + > /* > * A small SPTE exists for this pfn, but FNAME(fetch) > * and __direct_map would like to create a large PTE > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > index 5ca78a89d8ed..9a0bc19179b0 100644 > --- a/arch/x86/kvm/mmu/tdp_mmu.c > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > @@ -1207,6 +1207,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) > tdp_mmu_init_child_sp(sp, &iter); > > sp->lpage_disallowed = account_nx; > + /* comment goes here. */ > + smp_wmb(); > > if (tdp_mmu_link_sp(kvm, &iter, sp, true)) { > tdp_mmu_free_sp(sp); > > From 80d8bbd4a92faabc65cc5047c6b8ff1468cda1fa Mon Sep 17 00:00:00 2001 > From: Sean Christopherson <seanjc@xxxxxxxxxx> > Date: Tue, 29 Mar 2022 13:25:34 -0700 > Subject: [PATCH 1/2] KVM: x86/mmu: Set lpage_disallowed in TDP MMU before > setting SPTE > > Set lpage_disallowed in TDP MMU shadow pages before making the SP visible > to other readers, i.e. before setting its SPTE. This will allow KVM to > query lpage_disallowed when determining if a shadow page can be replaced > by a NX huge page without violating the rules of the mitigation. > > Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx> > --- > arch/x86/kvm/mmu/mmu.c | 14 ++++++++++---- > arch/x86/kvm/mmu/mmu_internal.h | 2 +- > arch/x86/kvm/mmu/tdp_mmu.c | 20 ++++++++++++-------- > 3 files changed, 23 insertions(+), 13 deletions(-) > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index 1361eb4599b4..5cb845fae56e 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -812,14 +812,20 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp) > kvm_mmu_gfn_disallow_lpage(slot, gfn); > } > > -void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp) > +void __account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp) > { > - if (sp->lpage_disallowed) > - return; > - > ++kvm->stat.nx_lpage_splits; > list_add_tail(&sp->lpage_disallowed_link, > &kvm->arch.lpage_disallowed_mmu_pages); > +} > + > +static void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp) > +{ > + if (sp->lpage_disallowed) > + return; > + > + __account_huge_nx_page(kvm, sp); > + > sp->lpage_disallowed = true; > } > > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h > index 1bff453f7cbe..4a0087efa1e3 100644 > --- a/arch/x86/kvm/mmu/mmu_internal.h > +++ b/arch/x86/kvm/mmu/mmu_internal.h > @@ -168,7 +168,7 @@ void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_ > > void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc); > > -void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp); > +void __account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp); > void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp); > > #endif /* __KVM_X86_MMU_INTERNAL_H */ > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > index b3b6426725d4..f05423545e6d 100644 > --- a/arch/x86/kvm/mmu/tdp_mmu.c > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > @@ -1122,16 +1122,13 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu, > * @kvm: kvm instance > * @iter: a tdp_iter instance currently on the SPTE that should be set > * @sp: The new TDP page table to install. > - * @account_nx: True if this page table is being installed to split a > - * non-executable huge page. > * @shared: This operation is running under the MMU lock in read mode. > * > * Returns: 0 if the new page table was installed. Non-0 if the page table > * could not be installed (e.g. the atomic compare-exchange failed). > */ > static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter, > - struct kvm_mmu_page *sp, bool account_nx, > - bool shared) > + struct kvm_mmu_page *sp, bool shared) > { > u64 spte = make_nonleaf_spte(sp->spt, !shadow_accessed_mask); > int ret = 0; > @@ -1146,8 +1143,6 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter, > > spin_lock(&kvm->arch.tdp_mmu_pages_lock); > list_add(&sp->link, &kvm->arch.tdp_mmu_pages); > - if (account_nx) > - account_huge_nx_page(kvm, sp); > spin_unlock(&kvm->arch.tdp_mmu_pages_lock); > > return 0; > @@ -1160,6 +1155,7 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter, > int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) > { > struct kvm_mmu *mmu = vcpu->arch.mmu; > + struct kvm *kvm = vcpu->kvm; > struct tdp_iter iter; > struct kvm_mmu_page *sp; > int ret; > @@ -1210,10 +1206,18 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) > sp = tdp_mmu_alloc_sp(vcpu); > tdp_mmu_init_child_sp(sp, &iter); > > - if (tdp_mmu_link_sp(vcpu->kvm, &iter, sp, account_nx, true)) { > + sp->lpage_disallowed = account_nx; > + > + if (tdp_mmu_link_sp(kvm, &iter, sp, true)) { > tdp_mmu_free_sp(sp); > break; > } > + > + if (account_nx) { > + spin_lock(&kvm->arch.tdp_mmu_pages_lock); > + __account_huge_nx_page(kvm, sp); > + spin_unlock(&kvm->arch.tdp_mmu_pages_lock); > + } > } > } > > @@ -1501,7 +1505,7 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter, > * correctness standpoint since the translation will be the same either > * way. > */ > - ret = tdp_mmu_link_sp(kvm, iter, sp, false, shared); > + ret = tdp_mmu_link_sp(kvm, iter, sp, shared); > if (ret) > goto out; > > > base-commit: 19164ad08bf668bca4f4bfbaacaa0a47c1b737a6 > -- > 2.35.1.1021.g381101b075-goog > > From 18eaf3b86dfd01a5b58d9755ba76fe8ff80702ab Mon Sep 17 00:00:00 2001 > From: Sean Christopherson <seanjc@xxxxxxxxxx> > Date: Tue, 29 Mar 2022 14:41:40 -0700 > Subject: [PATCH 2/2] KVM: x86/mmu: Track the number of TDP MMU pages, but not > the actual pages > > Track the number of TDP MMU "shadow" pages instead of tracking the pages > hemselves. With the NX huge page list manipulation moved out of the > common linking flow, elminating the list-based tracking means the happy > path of adding a shadow page doesn't need to acquire a spinlock and can > instead inc/dec an atomic. > > Keep the tracking as the WARN during TDP MMU teardown on leaked shadow > pages is very, very useful for detecting KVM bugs. > > Tracking the number of pages will also make it trivial to expose the > counter to userspace as a stat in the future, which may or may not be > desirable. > > Note, the TDP MMU needs to use a seperate counter (and stat if that ever > comes to be) from the existing n_used_mmu_pages. The TDP MMU doesn't > bother supporting the shrinker nor does it honor KVM_SET_NR_MMU_PAGES > (because the TDP MMU consumes so few pages relative to shadow paging), > and including TDP MMU pages in that counter would break both the shrinker > and shadow MMUs, e.g. if a VM is using nested TDP. > > Cc: Yosry Ahmed <yosryahmed@xxxxxxxxxx> > Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx> > --- > arch/x86/include/asm/kvm_host.h | 11 +++-------- > arch/x86/kvm/mmu/tdp_mmu.c | 16 ++++++++-------- > 2 files changed, 11 insertions(+), 16 deletions(-) > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index 9694dd5e6ccc..d0dd5ed2e209 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -1192,6 +1192,9 @@ struct kvm_arch { > */ > bool tdp_mmu_enabled; > > + /* The number of TDP MMU pages across all roots. */ > + atomic64_t tdp_mmu_pages; > + > /* > * List of struct kvm_mmu_pages being used as roots. > * All struct kvm_mmu_pages in the list should have > @@ -1212,18 +1215,10 @@ struct kvm_arch { > */ > struct list_head tdp_mmu_roots; > > - /* > - * List of struct kvmp_mmu_pages not being used as roots. > - * All struct kvm_mmu_pages in the list should have > - * tdp_mmu_page set and a tdp_mmu_root_count of 0. > - */ > - struct list_head tdp_mmu_pages; > - > /* > * Protects accesses to the following fields when the MMU lock > * is held in read mode: > * - tdp_mmu_roots (above) > - * - tdp_mmu_pages (above) > * - the link field of struct kvm_mmu_pages used by the TDP MMU > * - lpage_disallowed_mmu_pages > * - the lpage_disallowed_link field of struct kvm_mmu_pages used > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > index f05423545e6d..5ca78a89d8ed 100644 > --- a/arch/x86/kvm/mmu/tdp_mmu.c > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > @@ -24,7 +24,6 @@ bool kvm_mmu_init_tdp_mmu(struct kvm *kvm) > > INIT_LIST_HEAD(&kvm->arch.tdp_mmu_roots); > spin_lock_init(&kvm->arch.tdp_mmu_pages_lock); > - INIT_LIST_HEAD(&kvm->arch.tdp_mmu_pages); > kvm->arch.tdp_mmu_zap_wq = > alloc_workqueue("kvm", WQ_UNBOUND|WQ_MEM_RECLAIM|WQ_CPU_INTENSIVE, 0); > > @@ -51,7 +50,7 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm) > flush_workqueue(kvm->arch.tdp_mmu_zap_wq); > destroy_workqueue(kvm->arch.tdp_mmu_zap_wq); > > - WARN_ON(!list_empty(&kvm->arch.tdp_mmu_pages)); > + WARN_ON(atomic64_read(&kvm->arch.tdp_mmu_pages)); > WARN_ON(!list_empty(&kvm->arch.tdp_mmu_roots)); > > /* > @@ -381,14 +380,17 @@ static void handle_changed_spte_dirty_log(struct kvm *kvm, int as_id, gfn_t gfn, > static void tdp_mmu_unlink_sp(struct kvm *kvm, struct kvm_mmu_page *sp, > bool shared) > { > + atomic64_dec(&kvm->arch.tdp_mmu_pages); > + > + if (!sp->lpage_disallowed) > + return; > + > if (shared) > spin_lock(&kvm->arch.tdp_mmu_pages_lock); > else > lockdep_assert_held_write(&kvm->mmu_lock); > > - list_del(&sp->link); > - if (sp->lpage_disallowed) > - unaccount_huge_nx_page(kvm, sp); > + unaccount_huge_nx_page(kvm, sp); > > if (shared) > spin_unlock(&kvm->arch.tdp_mmu_pages_lock); > @@ -1141,9 +1143,7 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter, > tdp_mmu_set_spte(kvm, iter, spte); > } > > - spin_lock(&kvm->arch.tdp_mmu_pages_lock); > - list_add(&sp->link, &kvm->arch.tdp_mmu_pages); > - spin_unlock(&kvm->arch.tdp_mmu_pages_lock); > + atomic64_inc(&kvm->arch.tdp_mmu_pages); > > return 0; > } > -- > 2.35.1.1021.g381101b075-goog >