On Thu, Aug 12, 2021 at 11:18 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > Add yet another spinlock for the TDP MMU and take it when marking indirect > shadow pages unsync. When using the TDP MMU and L1 is running L2(s) with > nested TDP, KVM may encounter shadow pages for the TDP entries managed by > L1 (controlling L2) when handling a TDP MMU page fault. The unsync logic > is not thread safe, e.g. the kvm_mmu_page fields are not atomic, and > misbehaves when a shadow page is marked unsync via a TDP MMU page fault, > which runs with mmu_lock held for read, not write. > > Lack of a critical section manifests most visibly as an underflow of > unsync_children in clear_unsync_child_bit() due to unsync_children being > corrupted when multiple CPUs write it without a critical section and > without atomic operations. But underflow is the best case scenario. The > worst case scenario is that unsync_children prematurely hits '0' and > leads to guest memory corruption due to KVM neglecting to properly sync > shadow pages. > > Use an entirely new spinlock even though piggybacking tdp_mmu_pages_lock > would functionally be ok. Usurping the lock could degrade performance when > building upper level page tables on different vCPUs, especially since the > unsync flow could hold the lock for a comparatively long time depending on > the number of indirect shadow pages and the depth of the paging tree. > > For simplicity, take the lock for all MMUs, even though KVM could fairly > easily know that mmu_lock is held for write. If mmu_lock is held for > write, there cannot be contention for the inner spinlock, and marking > shadow pages unsync across multiple vCPUs will be slow enough that > bouncing the kvm_arch cacheline should be in the noise. > > Note, even though L2 could theoretically be given access to its own EPT > entries, a nested MMU must hold mmu_lock for write and thus cannot race > against a TDP MMU page fault. I.e. the additional spinlock only _needs_ to > be taken by the TDP MMU, as opposed to being taken by any MMU for a VM > that is running with the TDP MMU enabled. Holding mmu_lock for read also > prevents the indirect shadow page from being freed. But as above, keep > it simple and always take the lock. > > Alternative #1, the TDP MMU could simply pass "false" for can_unsync and > effectively disable unsync behavior for nested TDP. Write protecting leaf > shadow pages is unlikely to noticeably impact traditional L1 VMMs, as such > VMMs typically don't modify TDP entries, but the same may not hold true for > non-standard use cases and/or VMMs that are migrating physical pages (from > L1's perspective). > > Alternative #2, the unsync logic could be made thread safe. In theory, > simply converting all relevant kvm_mmu_page fields to atomics and using > atomic bitops for the bitmap would suffice. However, (a) an in-depth audit > would be required, (b) the code churn would be substantial, and (c) legacy > shadow paging would incur additional atomic operations in performance > sensitive paths for no benefit (to legacy shadow paging). > > Fixes: a2855afc7ee8 ("KVM: x86/mmu: Allow parallel page faults for the TDP MMU") > Cc: stable@xxxxxxxxxxxxxxx > Cc: Ben Gardon <bgardon@xxxxxxxxxx> > Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx> > --- > Documentation/virt/kvm/locking.rst | 8 ++++---- > arch/x86/include/asm/kvm_host.h | 7 +++++++ > arch/x86/kvm/mmu/mmu.c | 28 ++++++++++++++++++++++++++++ > 3 files changed, 39 insertions(+), 4 deletions(-) > > diff --git a/Documentation/virt/kvm/locking.rst b/Documentation/virt/kvm/locking.rst > index 8138201efb09..5d27da356836 100644 > --- a/Documentation/virt/kvm/locking.rst > +++ b/Documentation/virt/kvm/locking.rst > @@ -31,10 +31,10 @@ On x86: > > - vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock > > -- kvm->arch.mmu_lock is an rwlock. kvm->arch.tdp_mmu_pages_lock is > - taken inside kvm->arch.mmu_lock, and cannot be taken without already > - holding kvm->arch.mmu_lock (typically with ``read_lock``, otherwise > - there's no need to take kvm->arch.tdp_mmu_pages_lock at all). > +- kvm->arch.mmu_lock is an rwlock. kvm->arch.tdp_mmu_pages_lock and > + kvm->arch.mmu_unsync_pages_lock are taken inside kvm->arch.mmu_lock, and > + cannot be taken without already holding kvm->arch.mmu_lock (typically with > + ``read_lock`` for the TDP MMU, thus the need for additional spinlocks). > > Everything else is a leaf: no other lock is taken inside the critical > sections. > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index 20daaf67a5bf..cf32b87b6bd3 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -1036,6 +1036,13 @@ struct kvm_arch { > struct list_head lpage_disallowed_mmu_pages; > struct kvm_page_track_notifier_node mmu_sp_tracker; > struct kvm_page_track_notifier_head track_notifier_head; > + /* > + * Protects marking pages unsync during page faults, as TDP MMU page > + * faults only take mmu_lock for read. For simplicity, the unsync > + * pages lock is always taken when marking pages unsync regardless of > + * whether mmu_lock is held for read or write. > + */ > + spinlock_t mmu_unsync_pages_lock; > > struct list_head assigned_dev_head; > struct iommu_domain *iommu_domain; > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index a272ccbddfa1..cef526dac730 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -2596,6 +2596,7 @@ static void kvm_unsync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp) > int mmu_try_to_unsync_pages(struct kvm_vcpu *vcpu, gfn_t gfn, bool can_unsync) > { > struct kvm_mmu_page *sp; > + bool locked = false; > > /* > * Force write-protection if the page is being tracked. Note, the page It might also be worth adding a note about how it's safe to use for_each_gfn_indirect_valid_sp under the MMU read lock because mmu_page_hash is only modified with the lock held for write. > @@ -2618,9 +2619,34 @@ int mmu_try_to_unsync_pages(struct kvm_vcpu *vcpu, gfn_t gfn, bool can_unsync) > if (sp->unsync) > continue; > > + /* > + * TDP MMU page faults require an additional spinlock as they > + * run with mmu_lock held for read, not write, and the unsync > + * logic is not thread safe. Take the spinklock regardless of > + * the MMU type to avoid extra conditionals/parameters, there's > + * no meaningful penalty if mmu_lock is held for write. > + */ > + if (!locked) { > + locked = true; > + spin_lock(&vcpu->kvm->arch.mmu_unsync_pages_lock); > + > + /* > + * Recheck after taking the spinlock, a different vCPU > + * may have since marked the page unsync. A false > + * positive on the unprotected check above is not > + * possible as clearing sp->unsync _must_ hold mmu_lock > + * for write, i.e. unsync cannot transition from 0->1 > + * while this CPU holds mmu_lock for read (or write). > + */ > + if (READ_ONCE(sp->unsync)) > + continue; > + } > + > WARN_ON(sp->role.level != PG_LEVEL_4K); > kvm_unsync_page(vcpu, sp); > } > + if (locked) > + spin_unlock(&vcpu->kvm->arch.mmu_unsync_pages_lock); > > /* > * We need to ensure that the marking of unsync pages is visible > @@ -5604,6 +5630,8 @@ void kvm_mmu_init_vm(struct kvm *kvm) > { > struct kvm_page_track_notifier_node *node = &kvm->arch.mmu_sp_tracker; > > + spin_lock_init(&kvm->arch.mmu_unsync_pages_lock); > + > if (!kvm_mmu_init_tdp_mmu(kvm)) > /* > * No smp_load/store wrappers needed here as we are in > -- > 2.33.0.rc1.237.g0d66db33f3-goog >