Re: [PATCH 1/2] KVM: x86/mmu: Fix write-protection of PTs mapped by the TDP MMU

David Matlack <dmatlack@xxxxxxxxxx> · Wed, 12 Jan 2022 15:57:53 -0800

On Wed, Jan 12, 2022 at 3:14 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
>
> On Wed, Jan 12, 2022, David Matlack wrote:
> > When the TDP MMU is write-protection GFNs for page table protection (as
> > opposed to for dirty logging, or due to the HVA not being writable), it
> > checks if the SPTE is already write-protected and if so skips modifying
> > the SPTE and the TLB flush.
> >
> > This behavior is incorrect because the SPTE may be write-protected for
> > dirty logging. This implies that the SPTE could be locklessly be made
> > writable on the next write access, and that vCPUs could still be running
> > with writable SPTEs cached in their TLB.
> >
> > Fix this by unconditionally setting the SPTE and only skipping the TLB
> > flush if the SPTE was already marked !MMU-writable or !Host-writable,
> > which guarantees the SPTE cannot be locklessly be made writable and no
> > vCPUs are running the writable SPTEs cached in their TLBs.
> >
> > Technically it would be safe to skip setting the SPTE as well since:
> >
> >   (a) If MMU-writable is set then Host-writable must be cleared
> >       and the only way to set Host-writable is to fault the SPTE
> >       back in entirely (at which point any unsynced shadow pages
> >       reachable by the new SPTE will be synced and MMU-writable can
> >       be safetly be set again).
> >
> >   and
> >
> >   (b) MMU-writable is never consulted on its own.
> >
> > And in fact this is what the shadow MMU does when write-protecting guest
> > page tables. However setting the SPTE unconditionally is much easier to
> > reason about and does not require a huge comment explaining why it is safe.
>
> I disagree.  I looked at the code+comment before reading the full changelog and
> typed up a response saying the code should be:
>
>                 if (!is_writable_pte(iter.old_spte) &&
>                     !spte_can_locklessly_be_made_writable(spte))
>                         break;
>
> Then I went read the changelog and here we are :-)
>
> I find that much more easier to grok, e.g. in plain English: "if the SPTE isn't
> writable and can't be made writable, there's nothing to do".

Oh interesting. I actually find that confusing because it can easily
lead to the MMU-writable bit staying set. Here we are protecting GFNs
and we're opting to leave the MMU-writable bit set. It takes a lot of
digging to figure out that this is safe because if MMU-writable is set
and the SPTE cannot be locklessly be made writable then it implies
Host-writable is clear, and Host-writable can't be reset without
syncing the all shadow pages reachable by the MMU. Oh and the
MMU-writable bit is never consulted on its own (e.g. We never iterate
through all SPTEs to find the ones that are !MMU-writable).

Maybe my understanding is horribly off since this all seems
unnecessarily convoluted, and the cost of always clearing MMU-writable
is just an extra bitwise-OR.

The TLB flush is certainly unnecessary if the SPTE is already
!Host-writable, which is what this commit does.

>
> Versus "unconditionally clear the writable bits because ???, but only flush if
> the write was actually necessary", with a slightly opinionated translation :-)

If MMU-writable is already clear we can definitely break. I had that
in a previous version of the patch by checking if iter.old_spte ==
new_spte but it seemed unnecessary since the guts of
tdp_mmu_spte_set() already optimizes for this.

>
> And with that, you don't need to do s/spte_set/flush.  Though I would be in favor
> of a separate patch to do s/spte_set/write_protected here and in the caller, to
> match kvm_mmu_slot_gfn_write_protect().

I'm not sure write_protected would not be a good variable name because
even if we did not write-protect the SPTE (i.e. PT_WRITABLE_MASK was
already clear) we may still need a TLB flush to ensure no CPUs have a
writable SPTE in their TLB. Perhaps we have different definitions for
"write-protecting"?