Re: [RFC PATCH v2 5/5] KVM: Unmap pages only when it's indeed protected for NUMA migration

bibo mao <maobibo@xxxxxxxxxxx> · Wed, 16 Aug 2023 15:29:22 +0800



在 2023/8/16 13:14, Yan Zhao 写道:
> On Wed, Aug 16, 2023 at 11:44:29AM +0800, bibo mao wrote:
>>
>>
>> 在 2023/8/16 10:43, bibo mao 写道:
>>>
>>>
>>> 在 2023/8/15 22:50, Sean Christopherson 写道:
>>>> On Tue, Aug 15, 2023, Yan Zhao wrote:
>>>>> On Mon, Aug 14, 2023 at 09:40:44AM -0700, Sean Christopherson wrote:
>>>>>>>> Note, I'm assuming secondary MMUs aren't allowed to map swap entries...
>>>>>>>>
>>>>>>>> Compile tested only.
>>>>>>>
>>>>>>> I don't find a matching end to each
>>>>>>> mmu_notifier_invalidate_range_start_nonblock().
>>>>>>
>>>>>> It pairs with existing call to mmu_notifier_invalidate_range_end() in change_pmd_range():
>>>>>>
>>>>>> 	if (range.start)
>>>>>> 		mmu_notifier_invalidate_range_end(&range);
>>>>> No, It doesn't work for mmu_notifier_invalidate_range_start() sent in change_pte_range(),
>>>>> if we only want the range to include pages successfully set to PROT_NONE.
>>>>
>>>> Precise invalidation was a non-goal for my hack-a-patch.  The intent was purely
>>>> to defer invalidation until it was actually needed, but still perform only a
>>>> single notification so as to batch the TLB flushes, e.g. the start() call still
>>>> used the original @end.
>>>>
>>>> The idea was to play nice with the scenario where nothing in a VMA could be migrated.
>>>> It was complete untested though, so it may not have actually done anything to reduce
>>>> the number of pointless invalidations.
>>> For numa-balance scenery, can original page still be used by application even if pte
>>> is changed with PROT_NONE?  If it can be used, maybe we can zap shadow mmu and flush tlb
> For GUPs that does not honor FOLL_HONOR_NUMA_FAULT, yes,
> 
> See https://lore.kernel.org/all/20230803143208.383663-1-david@xxxxxxxxxx/
> 
>> Since there is kvm_mmu_notifier_change_pte notification when numa page is replaced with
>> new page, my meaning that can original page still be used by application even if pte
>> is changed with PROT_NONE and before replaced with new page?
> It's not .change_pte() notification, which is sent when COW.
> The do_numa_page()/do_huge_pmd_numa_page() will try to unmap old page
> protected with PROT_NONE, and if every check passes, a separate
> .invalidate_range_start()/end() with event type MMU_NOTIFY_CLEAR will be
> sent.
yes, you are right. change_pte() notification will be will called
when migrate_vma_pages, I messed it with numa page migration. However
invalidate_range_start()/end() with event type MMU_NOTIFY_CLEAR will be
sent also when new page is replaced.
> 
> So, I think KVM (though it honors FOLL_HONOR_NUMA_FAULT), can safely
> keep mapping maybe-dma pages until MMU_NOTIFY_CLEAR is sent.
> (this approach is implemented in RFC v1
> https://lore.kernel.org/all/20230810085636.25914-1-yan.y.zhao@xxxxxxxxx/)
> 
>>
>> And for primary mmu, tlb is flushed after pte is changed with PROT_NONE and 
>> after mmu_notifier_invalidate_range_end notification for secondary mmu.
>> Regards
>> Bibo Mao
> 
>>>> in notification mmu_notifier_invalidate_range_end with precised range, the range can
> But I don't think flush tlb only in the .invalidate_range_end() in
> secondary MMU is a good idea.
I have no good idea, and it beyond my ability to modify kvm framework now :(

> Flush must be done before kvm->mmu_lock is unlocked, otherwise,
> confusion will be caused when multiple threads trying to update the
> secondary MMU.
Since tlb flush is delayed after all pte entries are cleared, and currently
there is no tlb flush range supported for secondary mmu. I do know why there
is confusion before or after kvm->mmu_lock.

Regards
Bibo Mao
> 
>>>> be cross-range between range mmu_gather and mmu_notifier_range.
> 
>