On Tue, Aug 08, 2023 at 04:56:11PM -0700, Sean Christopherson wrote: > On Tue, Aug 08, 2023, Jason Gunthorpe wrote: > > On Tue, Aug 08, 2023 at 07:26:07AM -0700, Sean Christopherson wrote: > > > On Tue, Aug 08, 2023, Jason Gunthorpe wrote: > > > > On Tue, Aug 08, 2023 at 03:17:02PM +0800, Yan Zhao wrote: > > > > > @@ -859,6 +860,21 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root, > > > > > !is_last_spte(iter.old_spte, iter.level)) > > > > > continue; > > > > > > > > > > + if (skip_pinned) { > > > > > + kvm_pfn_t pfn = spte_to_pfn(iter.old_spte); > > > > > + struct page *page = kvm_pfn_to_refcounted_page(pfn); > > > > > + struct folio *folio; > > > > > + > > > > > + if (!page) > > > > > + continue; > > > > > + > > > > > + folio = page_folio(page); > > > > > + > > > > > + if (folio_test_anon(folio) && PageAnonExclusive(&folio->page) && > > > > > + folio_maybe_dma_pinned(folio)) > > > > > + continue; > > > > > + } > > > > > + > > > > > > > > I don't get it.. > > > > > > > > The last patch made it so that the NUMA balancing code doesn't change > > > > page_maybe_dma_pinned() pages to PROT_NONE > > > > > > > > So why doesn't KVM just check if the current and new SPTE are the same > > > > and refrain from invalidating if nothing changed? > > > > > > Because KVM doesn't have visibility into the current and new PTEs when the zapping > > > occurs. The contract for invalidate_range_start() requires that KVM drop all > > > references before returning, and so the zapping occurs before change_pte_range() > > > or change_huge_pmd() have done antyhing. > > > > > > > Duplicating the checks here seems very frail to me. > > > > > > Yes, this is approach gets a hard NAK from me. IIUC, folio_maybe_dma_pinned() > > > can yield different results purely based on refcounts, i.e. KVM could skip pages > > > that the primary MMU does not, and thus violate the mmu_notifier contract. And > > > in general, I am steadfastedly against adding any kind of heuristic to KVM's > > > zapping logic. > > > > > > This really needs to be fixed in the primary MMU and not require any direct > > > involvement from secondary MMUs, e.g. the mmu_notifier invalidation itself needs > > > to be skipped. > > > > This likely has the same issue you just described, we don't know if it > > can be skipped until we iterate over the PTEs and by then it is too > > late to invoke the notifier. Maybe some kind of abort and restart > > scheme could work? > > Or maybe treat this as a userspace config problem? Pinning DMA pages in a VM, > having a fair amount of remote memory, *and* expecting NUMA balancing to do anything > useful for that VM seems like a userspace problem. > > Actually, does NUMA balancing even support this particular scenario? I see this > in do_numa_page() > > /* TODO: handle PTE-mapped THP */ > if (PageCompound(page)) > goto out_map; hi Sean, I think compound page is handled in do_huge_pmd_numa_page(), and I do observed numa migration of those kind of pages. > and then for PG_anon_exclusive > > * ... For now, we only expect it to be > * set on tail pages for PTE-mapped THP. > */ > PG_anon_exclusive = PG_mappedtodisk, > > which IIUC means zapping these pages to do migrate_on-fault will never succeed. > > Can we just tell userspace to mbind() the pinned region to explicitly exclude the > VMA(s) from NUMA balancing? For VMs with VFIO mdev mediated devices, the VMAs to be pinned are dynamic, I think it's hard to mbind() in advance. Thanks Yan