Re: [PATCH v2 02/17] kvm: x86: Avoid taking MMU lock in kvm_mmu_sync_roots if no sync is needed

Junaid Shahid <junaids@xxxxxxxxxx> · Fri, 22 Jun 2018 17:46:40 -0700

On 06/22/2018 05:09 AM, Paolo Bonzini wrote:
> On 21/06/2018 22:17, Junaid Shahid wrote:
>> On 06/19/2018 10:58 PM, Paolo Bonzini wrote:
>>> On 20/06/2018 00:42, Junaid Shahid wrote:
>>>> Actually, now that I think about it again, I don't think that
>>>> any read barrier is needed here. I don't believe that the
>>>> scenario that I described in the comments in this patch needs a
>>>> read barrier (though please correct me if that isn't the case).
>>>> The code path that I actually had in mind when putting the
>>>> barrier here appears to already include a read barrier that I
>>>> missed seeing earlier.
>>>>
>>>> In any case, it seems that smp_read_barrier_depends() is a no-op
>>>> on x86. So we can just remove it altogether, unless you think
>>>> that it would be useful to keep it just as documentation.
>>>
>>> It's not entirely a no-op, it's (on x86) a compiler optimization
>>> barrier.
>>>
>>> In the comment above smp_wmb you should say what is the read
>>> barrier that is "paired" with the smp_wmb.  If there is none, that
>>> most likely means that the smp_read_barrier_depends _is_ actually
>>> needed.
>>
>> As best as I can tell, there isn't any explicit read barrier needed
>> corresponding to the smp_wmb() in mmu_need_write_protect(). In the
>> example scenario described there that necessitates the wmb(), the VM
>> exit (that resulted in the call to kvm_mmu_sync_roots) would act as
>> an implicit barrier that would order the sp->unsync read after
>> whatever condition triggered the sync (e.g. a guest TLB flush). Even
>> if smp_read_barrier_depends() acts as a compiler barrier, it wouldn't
>> add anything there. (It is very well possible that I am missing
>> something. Please do let me know if that is the case.)
> 
> Uhm, actually now I believe that the smp_wmb() is not needed.
> 
> Remember that, in the common case where you're "passing a message" from
> a writer to the reader, the writer writes A before B, while the reader
> reads B before A.  A is the "message", B is a readiness flag:
> 
> 	r1 = READ_ONCE(x);	 	WRITE_ONCE(y) = 1;
> 	smp_rmb();			smp_wmb();
> 	r2 = READ_ONCE(y);		WRITE_ONCE(x) = 1;
> 	BUG_ON(r1 == 1 && r2 == 0);
> 
> But this is not what is happening between sp->unsync and SPTE.  The
> important ordering on the write side cannot be "write sp->unsync before
> SPTE", because the ordering on the read side would be "read SPTE before
> sp->unsync".  This is of course not the case.
> 
> Rather, what we want here is for the reader to be conservative in taking
> the spinlock.  The unsync flag doesn't only tell you that you have some
> work to do, it also tells you if there _could_ be a concurrent update in
> progress and you really need to check the spinlock.  So the message is
> the spinlock, the flag is sp->unsync.
> 
> So the writer needs to write sp->unsync after writing the spinlock,
> which is guaranteed by spin_lock.  But the reader needs to read the
> spinlock after reading sp->unsync, and that requires an smp_rmb().
> 
> It's true that the reader needs to read the SPTE after sp->unsync.
> However that just works, through the mediation of the spinlock:
> 
> - if sp->unsync is set then the reader, thanks to smp_rmb()+spin_lock(),
> will see the writer's spin_unlock before proceeding with the sync.
> Therefore the SPTE will always be written before the reader reads it.
> 
> - if sp->unsync is clear, the reader cannot see *any* effect of the
> writer's critical section, including any SPTE write.  That's because the
> BUG_ON in the above pattern translates to "BUG_ON(sp->unsync &&
> !writer_took_the_spin_lock);".
> 
> 
> And fortunately we can use model-checking to validate the above.  Copy
> this into cppmem (http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/index.html)
> 
> int main() {
>   atomic_int unsync=0;
>   atomic_int lock=0;
>   int data=0;
>   {{{ { r11=unsync.load(memory_order_acquire);
>         r12=lock.load(memory_order_acquire).readsvalue(1);
>         r12=lock.load(memory_order_acquire).readsvalue(2);
>         r13=data;
>       }
>   ||| { r21=lock.load(memory_order_relaxed).readsvalue(0);
>         lock.store(1, memory_order_seq_cst);
>         unsync.store(1, memory_order_relaxed);
>         data = 1;
>         lock.store(2, memory_order_release);
>       }
>   }}}
>   return 0;
> }
> 
> Note that above I have a relaxed store to unsync (i.e. no barriers) and
> an acquire store from unsync (i.e. rmb).
> 
> Click "run" and you'll get "2 consistent, race free" execution.  One is
> for the case where thread 1 reads 0 for unsync, because it executes
> before the store; the second is for the case where it reads 1, and then
> in the graph below you'll see that this execution has "data=1".
> 
> There is only one snag in the reasoning.  If I change the first acquire
> to relaxed, it still works, meaning that the smp_rmb() is unnecessary
> too.  I am not sure why that is the case, but
> Documentation/memory-barriers.txt strongly suggests that we shouldn't
> expect that:
> 
>    Memory operations that occur before an ACQUIRE operation may appear
>    to happen after it completes.
> 
> So in the end I'm more confident about the smp_wmb() being unnecessary,
> than the smp_rmb().  (Not smp_read_barrier_depends(), since
> unsync->spinlock does not involve a data dependency).
> 

But in the absence of that smp_wmb(), what prevents the scenario described in the comments in mmu_need_write_protect() from happening?

Let us consider a simplified version of that scenario involving just two CPUs:

    CPU 1                   CPU 2
    -----                   -----
1.2 SPTE.W = 1
                        2.1 Guest writes a GPTE for GVA X.
                            (GPTE being in the guest page table shadowed
                             by the SP from CPU 1.)
                            This reads SPTE.W during the page table walk.
                            Since SPTE.W is read as 1, there is no fault.

                        2.2 Guest issues TLB flush.
                            That causes a VM Exit.

                        2.3 kvm_mmu_sync_pages() reads sp->unsync.
                            Since it is false, so it just returns.   

                        2.4 Guest accesses GVA X.
                            Since the mapping in the SP was not updated,
                            so the old mapping for GVA X gets used.
1.1 sp->unsync = true

The outcome in 2.4 is obviously incorrect. So we need to have a write barrier that ensures that 1.1 happens before 1.2, which prevents this outcome.

If we map this to the canonical example that you mentioned at the start, then x maps to the SPTE, y maps to sp->unsync, the smp_wmb() between WRITE_ONCE(y) and WRITE_ONCE(x) maps to an smp_wmb() between 1.1 and 1.2 and the smp_rmb() between READ_ONCE(x) and READ_ONCE(y) maps to the implicit barrier in 2.2.

Thanks,
Junaid

PS: I should replace the example in mmu_need_write_protect() with the example above.