在 2023/8/16 13:14, Yan Zhao 写道: > On Wed, Aug 16, 2023 at 11:44:29AM +0800, bibo mao wrote: >> >> >> 在 2023/8/16 10:43, bibo mao 写道: >>> >>> >>> 在 2023/8/15 22:50, Sean Christopherson 写道: >>>> On Tue, Aug 15, 2023, Yan Zhao wrote: >>>>> On Mon, Aug 14, 2023 at 09:40:44AM -0700, Sean Christopherson wrote: >>>>>>>> Note, I'm assuming secondary MMUs aren't allowed to map swap entries... >>>>>>>> >>>>>>>> Compile tested only. >>>>>>> >>>>>>> I don't find a matching end to each >>>>>>> mmu_notifier_invalidate_range_start_nonblock(). >>>>>> >>>>>> It pairs with existing call to mmu_notifier_invalidate_range_end() in change_pmd_range(): >>>>>> >>>>>> if (range.start) >>>>>> mmu_notifier_invalidate_range_end(&range); >>>>> No, It doesn't work for mmu_notifier_invalidate_range_start() sent in change_pte_range(), >>>>> if we only want the range to include pages successfully set to PROT_NONE. >>>> >>>> Precise invalidation was a non-goal for my hack-a-patch. The intent was purely >>>> to defer invalidation until it was actually needed, but still perform only a >>>> single notification so as to batch the TLB flushes, e.g. the start() call still >>>> used the original @end. >>>> >>>> The idea was to play nice with the scenario where nothing in a VMA could be migrated. >>>> It was complete untested though, so it may not have actually done anything to reduce >>>> the number of pointless invalidations. >>> For numa-balance scenery, can original page still be used by application even if pte >>> is changed with PROT_NONE? If it can be used, maybe we can zap shadow mmu and flush tlb > For GUPs that does not honor FOLL_HONOR_NUMA_FAULT, yes, > > See https://lore.kernel.org/all/20230803143208.383663-1-david@xxxxxxxxxx/ > >> Since there is kvm_mmu_notifier_change_pte notification when numa page is replaced with >> new page, my meaning that can original page still be used by application even if pte >> is changed with PROT_NONE and before replaced with new page? > It's not .change_pte() notification, which is sent when COW. > The do_numa_page()/do_huge_pmd_numa_page() will try to unmap old page > protected with PROT_NONE, and if every check passes, a separate > .invalidate_range_start()/end() with event type MMU_NOTIFY_CLEAR will be > sent. yes, you are right. change_pte() notification will be will called when migrate_vma_pages, I messed it with numa page migration. However invalidate_range_start()/end() with event type MMU_NOTIFY_CLEAR will be sent also when new page is replaced. > > So, I think KVM (though it honors FOLL_HONOR_NUMA_FAULT), can safely > keep mapping maybe-dma pages until MMU_NOTIFY_CLEAR is sent. > (this approach is implemented in RFC v1 > https://lore.kernel.org/all/20230810085636.25914-1-yan.y.zhao@xxxxxxxxx/) > >> >> And for primary mmu, tlb is flushed after pte is changed with PROT_NONE and >> after mmu_notifier_invalidate_range_end notification for secondary mmu. >> Regards >> Bibo Mao > >>>> in notification mmu_notifier_invalidate_range_end with precised range, the range can > But I don't think flush tlb only in the .invalidate_range_end() in > secondary MMU is a good idea. I have no good idea, and it beyond my ability to modify kvm framework now :( > Flush must be done before kvm->mmu_lock is unlocked, otherwise, > confusion will be caused when multiple threads trying to update the > secondary MMU. Since tlb flush is delayed after all pte entries are cleared, and currently there is no tlb flush range supported for secondary mmu. I do know why there is confusion before or after kvm->mmu_lock. Regards Bibo Mao > >>>> be cross-range between range mmu_gather and mmu_notifier_range. > >