Re: [RFC PATCH 3/3] KVM: x86/mmu: skip zap maybe-dma-pinned pages for NUMA migration

Yan Zhao <yan.y.zhao@xxxxxxxxx> · Wed, 9 Aug 2023 08:11:17 +0800



On Tue, Aug 08, 2023 at 04:56:11PM -0700, Sean Christopherson wrote:
> On Tue, Aug 08, 2023, Jason Gunthorpe wrote:
> > On Tue, Aug 08, 2023 at 07:26:07AM -0700, Sean Christopherson wrote:
> > > On Tue, Aug 08, 2023, Jason Gunthorpe wrote:
> > > > On Tue, Aug 08, 2023 at 03:17:02PM +0800, Yan Zhao wrote:
> > > > > @@ -859,6 +860,21 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
> > > > >  		    !is_last_spte(iter.old_spte, iter.level))
> > > > >  			continue;
> > > > >  
> > > > > +		if (skip_pinned) {
> > > > > +			kvm_pfn_t pfn = spte_to_pfn(iter.old_spte);
> > > > > +			struct page *page = kvm_pfn_to_refcounted_page(pfn);
> > > > > +			struct folio *folio;
> > > > > +
> > > > > +			if (!page)
> > > > > +				continue;
> > > > > +
> > > > > +			folio = page_folio(page);
> > > > > +
> > > > > +			if (folio_test_anon(folio) && PageAnonExclusive(&folio->page) &&
> > > > > +			    folio_maybe_dma_pinned(folio))
> > > > > +				continue;
> > > > > +		}
> > > > > +
> > > > 
> > > > I don't get it..
> > > > 
> > > > The last patch made it so that the NUMA balancing code doesn't change
> > > > page_maybe_dma_pinned() pages to PROT_NONE
> > > > 
> > > > So why doesn't KVM just check if the current and new SPTE are the same
> > > > and refrain from invalidating if nothing changed?
> > > 
> > > Because KVM doesn't have visibility into the current and new PTEs when the zapping
> > > occurs.  The contract for invalidate_range_start() requires that KVM drop all
> > > references before returning, and so the zapping occurs before change_pte_range()
> > > or change_huge_pmd() have done antyhing.
> > > 
> > > > Duplicating the checks here seems very frail to me.
> > > 
> > > Yes, this is approach gets a hard NAK from me.  IIUC, folio_maybe_dma_pinned()
> > > can yield different results purely based on refcounts, i.e. KVM could skip pages
> > > that the primary MMU does not, and thus violate the mmu_notifier contract.  And
> > > in general, I am steadfastedly against adding any kind of heuristic to KVM's
> > > zapping logic.
> > > 
> > > This really needs to be fixed in the primary MMU and not require any direct
> > > involvement from secondary MMUs, e.g. the mmu_notifier invalidation itself needs
> > > to be skipped.
> > 
> > This likely has the same issue you just described, we don't know if it
> > can be skipped until we iterate over the PTEs and by then it is too
> > late to invoke the notifier. Maybe some kind of abort and restart
> > scheme could work?
> 
> Or maybe treat this as a userspace config problem?  Pinning DMA pages in a VM,
> having a fair amount of remote memory, *and* expecting NUMA balancing to do anything
> useful for that VM seems like a userspace problem.
> 
> Actually, does NUMA balancing even support this particular scenario?  I see this
> in do_numa_page()
> 
> 	/* TODO: handle PTE-mapped THP */
> 	if (PageCompound(page))
> 		goto out_map;
hi Sean,
I think compound page is handled in do_huge_pmd_numa_page(), and I do
observed numa migration of those kind of pages.


> and then for PG_anon_exclusive
> 
> 	 * ... For now, we only expect it to be
> 	 * set on tail pages for PTE-mapped THP.
> 	 */
> 	PG_anon_exclusive = PG_mappedtodisk,
> 
> which IIUC means zapping these pages to do migrate_on-fault will never succeed.
> 
> Can we just tell userspace to mbind() the pinned region to explicitly exclude the
> VMA(s) from NUMA balancing?
For VMs with VFIO mdev mediated devices, the VMAs to be pinned are
dynamic, I think it's hard to mbind() in advance.

Thanks
Yan