On Thu, Aug 10, 2023 at 11:34:07AM +0200, David Hildenbrand wrote: > > This series first introduces a new flag MMU_NOTIFIER_RANGE_NUMA in patch 1 > > to work with mmu notifier event type MMU_NOTIFY_PROTECTION_VMA, so that > > the subscriber (e.g.KVM) of the mmu notifier can know that an invalidation > > event is sent for NUMA migration purpose in specific. > > > > Patch 2 skips setting PROT_NONE to long-term pinned pages in the primary > > MMU to avoid NUMA protection introduced page faults and restoration of old > > huge PMDs/PTEs in primary MMU. > > > > Patch 3 introduces a new mmu notifier callback .numa_protect(), which > > will be called in patch 4 when a page is ensured to be PROT_NONE protected. > > > > Then in patch 5, KVM can recognize a .invalidate_range_start() notification > > is for NUMA balancing specific and do not do the page unmap in secondary > > MMU until .numa_protect() comes. > > > > Why do we need all that, when we should simply not be applying PROT_NONE to > pinned pages? > > In change_pte_range() we already have: > > if (is_cow_mapping(vma->vm_flags) && > page_count(page) != 1) > > Which includes both, shared and pinned pages. Ah, right, currently in my side, I don't see any pinned pages are outside of this condition. But I have a question regarding to is_cow_mapping(vma->vm_flags), do we need to allow pinned pages in !is_cow_mapping(vma->vm_flags)? > Staring at page #2, are we still missing something similar for THPs? Yes. > Why is that MMU notifier thingy and touching KVM code required? Because NUMA balancing code will firstly send .invalidate_range_start() with event type MMU_NOTIFY_PROTECTION_VMA to KVM in change_pmd_range() unconditionally, before it goes down into change_pte_range() and change_huge_pmd() to check each page count and apply PROT_NONE. Then current KVM will unmap all notified pages from secondary MMU in .invalidate_range_start(), which could include pages that finally not set to PROT_NONE in primary MMU. For VMs with pass-through devices, though all guest pages are pinned, KVM still periodically unmap pages in response to the .invalidate_range_start() notification from auto NUMA balancing, which is a waste. So, if there's a new callback sent when pages is set to PROT_NONE for NUMA migrate only, KVM can unmap only those pages. As KVM still needs to unmap pages for other type of event in its handler of .invalidate_range_start() (.i.e. kvm_mmu_notifier_invalidate_range_start()), and MMU_NOTIFY_PROTECTION_VMA also include other reasons, so patch 1 added a range flag to help KVM not to do a blind unmap in .invalidate_range_start(), but do it in the new .numa_protect() handler. > > -- > Cheers, > > David / dhildenb > >