> > +static void kvm_mmu_notifier_numa_protect(struct mmu_notifier *mn, > > + struct mm_struct *mm, > > + unsigned long start, > > + unsigned long end) > > +{ > > + struct kvm *kvm = mmu_notifier_to_kvm(mn); > > + > > + WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count)); > > + if (!READ_ONCE(kvm->mmu_invalidate_in_progress)) > > + return; > > + > > + kvm_handle_hva_range(mn, start, end, __pte(0), kvm_unmap_gfn_range); > > +} > numa balance will scan wide memory range, and there will be one time Though scanning memory range is wide, .invalidate_range_start() is sent for each 2M range. > ipi notification with kvm_flush_remote_tlbs. With page level notification, > it may bring out lots of flush remote tlb ipi notification. Hmm, for VMs with assigned devices, apparently, the flush remote tlb IPIs will be reduced to 0 with this series. For VMs without assigned devices or mdev devices, I was previously also worried about that there might be more IPIs. But with current test data, there's no more remote tlb IPIs on average. The reason is below: Before this series, kvm_unmap_gfn_range() is called for once for a 2M range. After this series, kvm_unmap_gfn_range() is called for once if the 2M is mapped to a huge page in primary MMU, and called for at most 512 times if mapped to 4K pages in primary MMU. Though kvm_unmap_gfn_range() is only called once before this series, as the range is blockable, when there're contentions, remote tlb IPIs can be sent page by page in 4K granularity (in tdp_mmu_iter_cond_resched()) if the pages are mapped in 4K in secondary MMU. With this series, on the other hand, .numa_protect() sets range to be unblockable. So there could be less remote tlb IPIs when a 2M range is mapped into small PTEs in secondary MMU. Besides, .numa_protect() is not sent for all pages in a given 2M range. Below is my testing data on a VM without assigned devices: The data is an average of 10 times guest boot-up. data | numa balancing caused | numa balancing caused on average | #kvm_unmap_gfn_range() | #kvm_flush_remote_tlbs() -------------------|------------------------|-------------------------- before this series | 35 | 8625 after this series | 10037 | 4610 For a single guest bootup, | numa balancing caused | numa balancing caused best data | #kvm_unmap_gfn_range() | #kvm_flush_remote_tlbs() -------------------|------------------------|-------------------------- before this series | 28 | 13 after this series | 406 | 195 | numa balancing caused | numa balancing caused worst data | #kvm_unmap_gfn_range() | #kvm_flush_remote_tlbs() -------------------|------------------------|-------------------------- before this series | 44 | 43920 after this series | 17352 | 8668 > > however numa balance notification, pmd table of vm maybe needs not be freed > in kvm_unmap_gfn_range. >