On Wed, Aug 16, 2023 at 11:44:29AM +0800, bibo mao wrote: > > > 在 2023/8/16 10:43, bibo mao 写道: > > > > > > 在 2023/8/15 22:50, Sean Christopherson 写道: > >> On Tue, Aug 15, 2023, Yan Zhao wrote: > >>> On Mon, Aug 14, 2023 at 09:40:44AM -0700, Sean Christopherson wrote: > >>>>>> Note, I'm assuming secondary MMUs aren't allowed to map swap entries... > >>>>>> > >>>>>> Compile tested only. > >>>>> > >>>>> I don't find a matching end to each > >>>>> mmu_notifier_invalidate_range_start_nonblock(). > >>>> > >>>> It pairs with existing call to mmu_notifier_invalidate_range_end() in change_pmd_range(): > >>>> > >>>> if (range.start) > >>>> mmu_notifier_invalidate_range_end(&range); > >>> No, It doesn't work for mmu_notifier_invalidate_range_start() sent in change_pte_range(), > >>> if we only want the range to include pages successfully set to PROT_NONE. > >> > >> Precise invalidation was a non-goal for my hack-a-patch. The intent was purely > >> to defer invalidation until it was actually needed, but still perform only a > >> single notification so as to batch the TLB flushes, e.g. the start() call still > >> used the original @end. > >> > >> The idea was to play nice with the scenario where nothing in a VMA could be migrated. > >> It was complete untested though, so it may not have actually done anything to reduce > >> the number of pointless invalidations. > > For numa-balance scenery, can original page still be used by application even if pte > > is changed with PROT_NONE? If it can be used, maybe we can zap shadow mmu and flush tlb For GUPs that does not honor FOLL_HONOR_NUMA_FAULT, yes, See https://lore.kernel.org/all/20230803143208.383663-1-david@xxxxxxxxxx/ > Since there is kvm_mmu_notifier_change_pte notification when numa page is replaced with > new page, my meaning that can original page still be used by application even if pte > is changed with PROT_NONE and before replaced with new page? It's not .change_pte() notification, which is sent when COW. The do_numa_page()/do_huge_pmd_numa_page() will try to unmap old page protected with PROT_NONE, and if every check passes, a separate .invalidate_range_start()/end() with event type MMU_NOTIFY_CLEAR will be sent. So, I think KVM (though it honors FOLL_HONOR_NUMA_FAULT), can safely keep mapping maybe-dma pages until MMU_NOTIFY_CLEAR is sent. (this approach is implemented in RFC v1 https://lore.kernel.org/all/20230810085636.25914-1-yan.y.zhao@xxxxxxxxx/) > > And for primary mmu, tlb is flushed after pte is changed with PROT_NONE and > after mmu_notifier_invalidate_range_end notification for secondary mmu. > Regards > Bibo Mao > >> in notification mmu_notifier_invalidate_range_end with precised range, the range can But I don't think flush tlb only in the .invalidate_range_end() in secondary MMU is a good idea. Flush must be done before kvm->mmu_lock is unlocked, otherwise, confusion will be caused when multiple threads trying to update the secondary MMU. > >> be cross-range between range mmu_gather and mmu_notifier_range.