Re: [RFC PATCH 0/11] Rework gfn_to_pfn_cache

Sean Christopherson <seanjc@xxxxxxxxxx> · Mon, 15 Nov 2021 22:59:54 +0000

On Mon, Nov 15, 2021, Paolo Bonzini wrote:
> On 11/15/21 20:11, David Woodhouse wrote:
> > > Changing mn_memslots_update_rcuwait to a waitq (and renaming it to
> > > mn_invalidate_waitq) is of course also a possibility.
> > I suspect that's the answer.
> > 
> > I think the actual*invalidation*  of the cache still lives in the
> > invalidate_range() callback where I have it at the moment.

Oooh!  [finally had a lightbulb moment about ->invalidate_range() after years of
befuddlement].

Two things:

  1. Using _only_ ->invalidate_range() is not correct.  ->invalidate_range() is
     required if and only if the old PFN needs to be _unmapped_.  Specifically,
     if the protections are being downgraded without changing the PFN, it doesn't
     need to be called.  E.g. from hugetlb_change_protection():

	/*
	 * No need to call mmu_notifier_invalidate_range() we are downgrading
	 * page table protection not changing it to point to a new page.
	 *
	 * See Documentation/vm/mmu_notifier.rst
	 */

     x86's kvm_arch_mmu_notifier_invalidate_range() is a special snowflake because
     the APIC access page's VMA is controlled by KVM, i.e. is never downgraded, the
     only thing KVM cares about is if the PFN is changed, because that's the only
     thing that can change.

     In this case, if an HVA is downgraded from RW=R, KVM may not invalidate the
     cache and end up writing to memory that is supposed to be read-only.

     I believe we could use ->invalidate_range() to handle the unmap case if KVM's
     ->invalidate_range_start() hook is enhanced to handle the RW=>R case.  The
     "struct mmu_notifier_range" provides the event type, IIUC we could have the
     _start() variant handle MMU_NOTIFY_PROTECTION_{VMA,PAGE} (and maybe
     MMU_NOTIFY_SOFT_DIRTY?), and let the more precise unmap-only variant handle
     everything else.

  2. If we do split the logic across the two hooks, we should (a) do it in a separate
     series and (b) make the logic common to the gfn_to_pfn cache and to the standard
     kvm_unmap_gfn_range().  That would in theory shave a bit of time off walking
     gfn ranges (maybe even moreso with the scalable memslots implementation?), and
     if we're lucky, would resurrect the mostly-dead .change_pte() hook (see commit
     c13fda237f08 ("KVM: Assert that notifier count is elevated in .change_pte()")).

> > But making the req to the affected vCPUs can live in
> > invalidate_range_start(). And then the code which*handles*  that req can
> > wait for the mmu_notifier_count to reach zero before it proceeds. Atomic
> > users of the cache (like the Xen event channel code) don't have to get
> > involved with that.
> > 
> > > Also, for the small requests: since you are at it, can you add the code
> > > in a new file under virt/kvm/?
> > 
> > Hm... only if I can make hva_to_pfn() and probably a handful of other
> > things non-static?
> 
> Yes, I think sooner or later we also want all pfn stuff in one file
> (together with MMU notifiers) and all hva stuff in another; so for now you
> can create virt/kvm/hva_to_pfn.h, or virt/kvm/mm.h, or whatever color of the
> bikeshed you prefer.

Preemptive bikeshed strike... the MMU notifiers aren't strictly "pfn stuff", as
they operate on HVAs.  I don't know exactly what Paolo has in mind, but kvm/mm.h
or kvm/kvm_mm.h seems like it's less likely to become stale in the future.