Re: [RFC PATCH 3/3] KVM: x86/mmu: skip zap maybe-dma-pinned pages for NUMA migration

Yan Zhao <yan.y.zhao@xxxxxxxxx> · Wed, 9 Aug 2023 08:29:31 +0800

On Tue, Aug 08, 2023 at 07:26:07AM -0700, Sean Christopherson wrote:
> On Tue, Aug 08, 2023, Jason Gunthorpe wrote:
> > On Tue, Aug 08, 2023 at 03:17:02PM +0800, Yan Zhao wrote:
> > > @@ -859,6 +860,21 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
> > >  		    !is_last_spte(iter.old_spte, iter.level))
> > >  			continue;
> > >  
> > > +		if (skip_pinned) {
> > > +			kvm_pfn_t pfn = spte_to_pfn(iter.old_spte);
> > > +			struct page *page = kvm_pfn_to_refcounted_page(pfn);
> > > +			struct folio *folio;
> > > +
> > > +			if (!page)
> > > +				continue;
> > > +
> > > +			folio = page_folio(page);
> > > +
> > > +			if (folio_test_anon(folio) && PageAnonExclusive(&folio->page) &&
> > > +			    folio_maybe_dma_pinned(folio))
> > > +				continue;
> > > +		}
> > > +
> > 
> > I don't get it..
> > 
> > The last patch made it so that the NUMA balancing code doesn't change
> > page_maybe_dma_pinned() pages to PROT_NONE
> > 
> > So why doesn't KVM just check if the current and new SPTE are the same
> > and refrain from invalidating if nothing changed?
> 
> Because KVM doesn't have visibility into the current and new PTEs when the zapping
> occurs.  The contract for invalidate_range_start() requires that KVM drop all
> references before returning, and so the zapping occurs before change_pte_range()
> or change_huge_pmd() have done antyhing.
> 
> > Duplicating the checks here seems very frail to me.
> 
> Yes, this is approach gets a hard NAK from me.  IIUC, folio_maybe_dma_pinned()
> can yield different results purely based on refcounts, i.e. KVM could skip pages
Do you mean the different results of folio_maybe_dma_pinned() and
page_maybe_dma_pinned()?

I choose to use folio_maybe_dma_pinned() in KVM on purpose because in
this .invalidate_range_start() handler in KVM, we may get tail pages of
a folio, so it's better to call this folio's version of folio_maybe_dma_pinned().

However, in mm core, i.e. in change_huge_pmd() and change_pte_range(),
the "page" it gets is always head page of a folio, so though
page_maybe_dma_pinned() is called in it, it actually equals to
folio_maybe_dma_pinned(page_folio(page)).

So, I think the two sides should yield equal results.

On this other hand, if you are concerning about the ref count of page is
dynamic, and because KVM and mm core do not check ref count of a page
atomically, I think it's still fine.
Because, the notification of .invalidate_range_start() with event type
MMU_NOTIFY_PROTECTION_VMA only means the corresponding PTE is protected
in the primary MMU, it does not mean the page is UNMAPed.

In series [1], we can even see that for processes other than KVM, the
PROT_NONE in primary MMU for NUMA migration purpose is actually ignored
and the underlying PFNs are still accessed.

So, could KVM open a door for maybe-dma-pinned pages, and keeps mapping
those pages until
(1) a invalidate notification other than MMU_NOTIFY_PROTECTION_VMA comes or
(2) a invalidate notification with MMU_NOTIFY_PROTECTION_VMA comes again with
reduced page ref count?

[1]: https://lore.kernel.org/all/20230803143208.383663-1-david@xxxxxxxxxx/

Thanks
Yan

> that the primary MMU does not, and thus violate the mmu_notifier contract.  And
> in general, I am steadfastedly against adding any kind of heuristic to KVM's
> zapping logic.
> 
> This really needs to be fixed in the primary MMU and not require any direct
> involvement from secondary MMUs, e.g. the mmu_notifier invalidation itself needs
> to be skipped.
>