Re: [PATCH v2 02/17] kvm: x86: Avoid taking MMU lock in kvm_mmu_sync_roots if no sync is needed

Paolo Bonzini <pbonzini@xxxxxxxxxx> · Mon, 25 Jun 2018 16:05:34 +0200

On 23/06/2018 02:46, Junaid Shahid wrote:
>     CPU 1                   CPU 2
>     -----                   -----
> 1.2 SPTE.W = 1
>                         2.1 Guest writes a GPTE for GVA X.
>                             (GPTE being in the guest page table shadowed
>                              by the SP from CPU 1.)
>                             This reads SPTE.W during the page table walk.
>                             Since SPTE.W is read as 1, there is no fault.
> 
>                         2.2 Guest issues TLB flush.
>                             That causes a VM Exit.
>                                     
>                         2.3 kvm_mmu_sync_pages() reads sp->unsync.
>                             Since it is false, so it just returns.   
> 
>                         2.4 Guest accesses GVA X.
>                             Since the mapping in the SP was not updated,
>                             so the old mapping for GVA X gets used.
> 1.1 sp->unsync = true
> 
> The outcome in 2.4 is obviously incorrect. So we need to have a
> write barrier that ensures that 1.1 happens before 1.2, which
> prevents this outcome.
>
> If we map this to the canonical example that you mentioned at the
> start, then x maps to the SPTE, y maps to sp->unsync, the smp_wmb()
> between WRITE_ONCE(y) and WRITE_ONCE(x) maps to an smp_wmb() between 1.1
> and 1.2 and the smp_rmb() between READ_ONCE(x) and READ_ONCE(y) maps to
> the implicit barrier in 2.2.

Aha, so my confusion here comes before the read of the GPTE is used
after kvm_mmu_sync_pages, but it's a cached value that was read *before*
sp->unsync.  If you take the cached read into account, it definitely
looks like the canonical example.  The two-CPU case is much easier to
follow.

So let's put both memory barriers... which unintuitively don't pair with
each other.  Something like:

1) /*
    * When reading sp->unsync after a vmexit, the processor might have
    * some SPTEs with SPTE.W=1 cached in the TLB.  The vmexit already
    * ensures that sp->unsync is read after the TLB fill (x86 doesn't
    * reorder reads, and vmentry/vmexit provides a compiler barrier);
    * here we ensure that sp->unsync is written before the SPTE.
    * Otherwise, a TLB flush might skip the synchronization of the pages
    * we're write protecting: <insert example>
    */
   smp_wmb();

2) /*
    * Read sp->unsync before the spinlock.  sp->unsync is written in
    * kvm_mmu->page_fault with mmu_lock held; this barrier pairs with
    * spin_lock's acquire barrier, which guarantees that the spinlock
    * is written before sp->unsync.
    */
   smp_rmb();

Feel free to make it less verbose. :)

Paolo