On Mon, Aug 14, 2023 at 09:40:44AM -0700, Sean Christopherson wrote: > > > I'm strongly opposed to adding MMU_NOTIFIER_RANGE_NUMA. It's too much of a one-off, > > > and losing the batching of invalidations makes me nervous. As Bibo points out, > > > the behavior will vary based on the workload, VM configuration, etc. > > > > > > There's also a *very* subtle change, in that the notification will be sent while > > > holding the PMD/PTE lock. Taking KVM's mmu_lock under that is *probably* ok, but > > > I'm not exactly 100% confident on that. And the only reason there isn't a more > > MMU lock is a rwlock, which is a variant of spinlock. > > So, I think taking it within PMD/PTE lock is ok. > > Actually we have already done that during the .change_pte() notification, where > > > > kvm_mmu_notifier_change_pte() takes KVM mmu_lock for write, > > while PTE lock is held while sending set_pte_at_notify() --> .change_pte() handlers > > .change_pte() gets away with running under PMD/PTE lock because (a) it's not allowed > to fail and (b) KVM is the only secondary MMU that hooks .change_pte() and KVM > doesn't use a sleepable lock. .numa_protect() in patch 4 is also sent when it's not allowed to fail and there's no sleepable lock in KVM's handler :) > As Jason pointed out, mmu_notifier_invalidate_range_start_nonblock() can fail > because some secondary MMUs use mutexes or r/w semaphores to handle mmu_notifier > events. For NUMA balancing, canceling the protection change might be ok, but for > everything else, failure is not an option. So unfortunately, my idea won't work > as-is. > > We might get away with deferring just the change_prot_numa() case, but I don't > think that's worth the mess/complexity. Yes, I also think deferring mmu_notifier_invalidate_range_start() is not working. One possible approach is to send successful .numa_protect() in batch. But I admit it will introduce complexity. > > I would much rather tell userspace to disable migrate-on-fault for KVM guest > mappings (mbind() should work?) for these types of setups, or to disable NUMA > balancing entirely. If the user really cares about performance of their VM(s), > vCPUs should be affined to a single NUMA node (or core, or pinned 1:1), and if > the VM spans multiple nodes, a virtual NUMA topology should be given to the guest. > At that point, NUMA balancing is likely to do more harm than good. > > > > obvious bug is because kvm_handle_hva_range() sets may_block to false, e.g. KVM > > > won't yield if there's mmu_lock contention. > > Yes, KVM won't yield and reschedule of KVM mmu_lock, so it's fine. > > > > > However, *if* it's ok to invoke MMU notifiers while holding PMD/PTE locks, then > > > I think we can achieve what you want without losing batching, and without changing > > > secondary MMUs. > > > > > > Rather than muck with the notification types and add a one-off for NUMA, just > > > defer the notification until a present PMD/PTE is actually going to be modified. > > > It's not the prettiest, but other than the locking, I don't think has any chance > > > of regressing other workloads/configurations. > > > > > > Note, I'm assuming secondary MMUs aren't allowed to map swap entries... > > > > > > Compile tested only. > > > > I don't find a matching end to each > > mmu_notifier_invalidate_range_start_nonblock(). > > It pairs with existing call to mmu_notifier_invalidate_range_end() in change_pmd_range(): > > if (range.start) > mmu_notifier_invalidate_range_end(&range); No, It doesn't work for mmu_notifier_invalidate_range_start() sent in change_pte_range(), if we only want the range to include pages successfully set to PROT_NONE. > -- > From: Sean Christopherson <seanjc@xxxxxxxxxx> > Date: Mon, 14 Aug 2023 08:59:12 -0700 > Subject: [PATCH] KVM: x86/mmu: Retry fault before acquiring mmu_lock if > mapping is changing > > Retry page faults without acquiring mmu_lock if the resolve hva is covered > by an active invalidation. Contending for mmu_lock is especially > problematic on preemptible kernels as the invalidation task will yield the > lock (see rwlock_needbreak()), delay the in-progress invalidation, and > ultimately increase the latency of resolving the page fault. And in the > worst case scenario, yielding will be accompanied by a remote TLB flush, > e.g. if the invalidation covers a large range of memory and vCPUs are > accessing addresses that were already zapped. > > Alternatively, the yielding issue could be mitigated by teaching KVM's MMU > iterators to perform more work before yielding, but that wouldn't solve > the lock contention and would negatively affect scenarios where a vCPU is > trying to fault in an address that is NOT covered by the in-progress > invalidation. > > Reported-by: Yan Zhao <yan.y.zhao@xxxxxxxxx> > Closes: https://lore.kernel.org/all/ZNnPF4W26ZbAyGto@xxxxxxxxxxxxxxxxxxxxxxxxx > Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx> > --- > arch/x86/kvm/mmu/mmu.c | 3 +++ > include/linux/kvm_host.h | 8 +++++++- > 2 files changed, 10 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index 9e4cd8b4a202..f29718a16211 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -4345,6 +4345,9 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, > if (unlikely(!fault->slot)) > return kvm_handle_noslot_fault(vcpu, fault, access); > > + if (mmu_invalidate_retry_hva(vcpu->kvm, fault->mmu_seq, fault->hva)) > + return RET_PF_RETRY; > + This can effectively reduce the remote flush IPIs a lot! One Nit is that, maybe rmb() or READ_ONCE() is required for kvm->mmu_invalidate_range_start and kvm->mmu_invalidate_range_end. Otherwise, I'm somewhat worried about constant false positive and retry. > return RET_PF_CONTINUE; > } > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index cb86108c624d..f41d4fe61a06 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -1960,7 +1960,6 @@ static inline int mmu_invalidate_retry_hva(struct kvm *kvm, > unsigned long mmu_seq, > unsigned long hva) > { > - lockdep_assert_held(&kvm->mmu_lock); > /* > * If mmu_invalidate_in_progress is non-zero, then the range maintained > * by kvm_mmu_notifier_invalidate_range_start contains all addresses > @@ -1971,6 +1970,13 @@ static inline int mmu_invalidate_retry_hva(struct kvm *kvm, > hva >= kvm->mmu_invalidate_range_start && > hva < kvm->mmu_invalidate_range_end) > return 1; > + > + /* > + * Note the lack of a memory barrier! The caller *must* hold mmu_lock > + * to avoid false negatives and/or false positives (less concerning). > + * Holding mmu_lock is not mandatory though, e.g. to allow pre-checking > + * for an in-progress invalidation to avoiding contending mmu_lock. > + */ > if (kvm->mmu_invalidate_seq != mmu_seq) > return 1; > return 0; > > base-commit: 5bc7f472423f95a3f5c73b0b56c47e57d8833efc > -- >