Re: [PATCH 2/2] KVM: x86/mmu: Improve comment about TLB flush semantics for write-protection

David Matlack <dmatlack@xxxxxxxxxx> · Thu, 13 Jan 2022 09:10:07 -0800

On Wed, Jan 12, 2022 at 4:46 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
>
> On Wed, Jan 12, 2022, David Matlack wrote:
> > Rewrite the comment in kvm_mmu_slot_remove_write_access() that explains
> > why it is safe to flush TLBs outside of the MMU lock after
> > write-protecting SPTEs for dirty logging. The current comment is a long
> > run-on sentance that was difficult to undertsand. In addition it was
> > specific to the shadow MMU (mentioning mmu_spte_update()) when the TDP
> > MMU has to handle this as well.
> >
> > Signed-off-by: David Matlack <dmatlack@xxxxxxxxxx>
> > ---
> >  arch/x86/kvm/mmu/mmu.c | 29 ++++++++++++++++++++---------
> >  1 file changed, 20 insertions(+), 9 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 1d275e9d76b5..33f550b3be8f 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -5825,15 +5825,26 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
> >       }
> >
> >       /*
> > -      * We can flush all the TLBs out of the mmu lock without TLB
> > -      * corruption since we just change the spte from writable to
> > -      * readonly so that we only need to care the case of changing
> > -      * spte from present to present (changing the spte from present
> > -      * to nonpresent will flush all the TLBs immediately), in other
> > -      * words, the only case we care is mmu_spte_update() where we
> > -      * have checked Host-writable | MMU-writable instead of
> > -      * PT_WRITABLE_MASK, that means it does not depend on PT_WRITABLE_MASK
> > -      * anymore.
> > +      * It is safe to flush TLBs outside of the MMU lock since SPTEs are only
> > +      * being changed from writable to read-only (i.e. the mapping to host
> > +      * PFNs is not changing).
>
> Hmm, you mostly address things in the next sentence/paragraph, but it's more than
> the SPTE being downgraded from writable => read-only, e.g. if the SPTE were being
> made read-only due to userspace removing permissions, then KVM would need to flush
> before dropping mmu_lock.  The qualifier about the PFN not changing actually does
> more harm than good because it further suggests that writable => read-only is
> somehow inherently safe.
>
> > +      * All we care about is that CPUs start using the
> > +      * read-only mappings from this point forward to ensure the dirty bitmap
> > +      * gets updated, but that does not need to run under the MMU lock.
>
> "this point forward" isn't technically true, the requirement is that the flush
> occurs before the memslot update completes.  Definitely splitting hairs, I mean,
> this basically is the end of the memslot update, but it's an opportunity to
> clarify _why_ the flush needs to happen at this point.
>
> > +      *
> > +      * Note that there are other reasons why SPTEs can be write-protected
> > +      * besides dirty logging: (1) to intercept guest page table
> > +      * modifications when doing shadow paging and (2) to protecting guest
> > +      * memory that is not host-writable.
>
> So, technically, #2 is not possible.  KVM doesn't allow a memslot to be converted
> from writable => read-only, userspace must first delete the entire memslot.  That
> means the relevant SPTEs never transition directly from writable to !writable,
> they are instead zapped entirely and "new" SPTEs are created that are read-only
> from their genesis.
>
> Making a VMA read-only also results in SPTEs being zapped and recreated, though
> this is an area for improvement.  We could cover future changes in this area by
> being a bit fuzzy in the wording, but I think it would be better to talk only
> about the shadow paging case and thus only about MMU-writable, because Host-writable
> is (currently) immutable and making it mutable (in the mmu_notifier path) will
> have additional "rule" changes.
>
> > +      * Both of these usecases require
> > +      * flushing the TLB under the MMU lock to ensure CPUs are not running
> > +      * with writable SPTEs in their TLB. The tricky part is knowing when it
> > +      * is safe to skip a TLB flush if an SPTE is already write-protected,
> > +      * since it could have been write-protected for dirty-logging which does
> > +      * not flush under the lock.
>
> It's a bit unclear that the last "skip a TLB flush" snippet is referring to a
> future TLB flush, not this TLB flush.
>
> > +      *
> > +      * To handle this each SPTE has an MMU-writable bit and a Host-writable
> > +      * bit (KVM-specific bits that are not used by hardware). These bits
> > +      * allow KVM to deduce *why* a given SPTE is currently write-protected,
> > +      * so that it knows when it needs to flush TLBs under the MMU lock.
>
> I much rather we add this type of comment over the definitions for
> DEFAULT_SPTE_{HOST,MMU}_WRITEABLE and then provide a reference to those definitions
> after a very brief "KVM uses the MMU-writable bit".
>
> So something like this?  Plus more commentry in spte.h.
>
>         /*
>          * It's safe to flush TLBs after dropping mmu_lock as making a writable
>          * SPTE read-only for dirty logging only needs to ensure KVM starts
>          * logging writes to the memslot before the memslot update completes,
>          * i.e. before the enabling of dirty logging is visible to userspace.
>          *
>          * Note, KVM also write-protects SPTEs when shadowing guest page tables,
>          * in which case a TLB flush is needed before dropping mmu_lock().  To
>          * ensure a future TLB flush isn't missed, KVM uses a software-available
>          * bit to track if a SPTE is MMU-Writable, i.e. is considered writable
>          * for shadow paging purposes.  When write-protecting for shadow paging,
>          * KVM clears both WRITABLE and MMU-Writable, and performs a TLB flush
>          * while holding mmu_lock if either bit is cleared.
>          *
>          * See DEFAULT_SPTE_{HOST,MMU}_WRITEABLE for more details.
>          */

Makes sense. I'll rework the comment per your feedback and also
document the {host,mmu}-writable bits. Although I think it'd make more
sense to put those comments on shadow_{host,mmu}_writable_mask as
those are the symbols used throughout the code and EPT uses different
bits than DEFAULT_..._WRITABLE.

>
> >        */
> >       if (flush)
> >               kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
> > --
> > 2.34.1.703.g22d0c6ccf7-goog
> >