Re: [PATCH v2 3/5] KVM: Flush TLB in mmu notifier without holding mmu_lock

Avi Kivity <avi@xxxxxxxxxx> · Mon, 02 Jul 2012 15:05:17 +0300

Revisiting after hiatus.

On 05/21/2012 11:58 PM, Marcelo Tosatti wrote:
> On Thu, May 17, 2012 at 01:24:42PM +0300, Avi Kivity wrote:
>> Signed-off-by: Avi Kivity <avi@xxxxxxxxxx>
>> ---
>>  virt/kvm/kvm_main.c |   16 ++++++++--------
>>  1 file changed, 8 insertions(+), 8 deletions(-)
>> 
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index 585ab45..9f6d15d 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -302,11 +302,11 @@ static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
>>  	kvm->mmu_notifier_seq++;
>>  	if (kvm_unmap_hva(kvm, address))
>>  		kvm_mark_tlb_dirty(kvm);
>> -	/* we've to flush the tlb before the pages can be freed */
>> -	kvm_cond_flush_remote_tlbs(kvm);
>> -
>>  	spin_unlock(&kvm->mmu_lock);
>>  	srcu_read_unlock(&kvm->srcu, idx);
>> +
>> +	/* we've to flush the tlb before the pages can be freed */
>> +	kvm_cond_flush_remote_tlbs(kvm);
>>  }
> 
> There are still sites that assumed implicitly that acquiring mmu_lock
> guarantees that sptes and remote TLBs are in sync. Example:
> 
> void kvm_mmu_zap_all(struct kvm *kvm)
> {
>         struct kvm_mmu_page *sp, *node;
>         LIST_HEAD(invalid_list);
> 
>         spin_lock(&kvm->mmu_lock);
> restart:
>         list_for_each_entry_safe(sp, node, &kvm->arch.active_mmu_pages,
> link)
>                 if (kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list))
>                         goto restart;
> 
>         kvm_mmu_commit_zap_page(kvm, &invalid_list);
>         spin_unlock(&kvm->mmu_lock);
> }
> 
> kvm_mmu_commit_zap_page only flushes if the TLB was dirtied by this
> context, not before. kvm_mmu_unprotect_page is similar.
> 
> In general:
> 
> context 1                       context 2
> 
> lock(mmu_lock)
> modify spte
> mark_tlb_dirty
> unlock(mmu_lock)
>                                 lock(mmu_lock)
>                                 read spte
>                                 make a decision based on spte value
>                                 unlock(mmu_lock)
> flush remote TLBs
> 
> 
> Is scary.

It scares me too.  Could be something trivial like following an
intermediate paging structure entry or not, depending on whether it is
present.  I don't think we'll be able to audit the entire code base to
ensure nothing like that happens, or to enforce it later on.

> Perhaps have a rule that says:
> 
> 1) Conditionally flush remote TLB after acquiring mmu_lock, 
> before anything (even perhaps inside the lock macro).

I would like to avoid any flush within the lock.  We may back down on
this goal but let's try to find something that works and doesn't depend
on huge audits.

> 2) Except special cases where it is clear that this is not 
> necessary.

One option: your idea, but without taking the lock

  def mmu_begin():
      clean = False
      while not clean:
        cond_flush the tlb
        rtlb = kvm.remote_tlb_counter # atomically wrt flush
        spin_lock mmu_lock
        if rtlb != kvm.remote_tlb_counter
            clean = False
            spin_unlock(mmu_lock)

Since we're spinning over the tlb counter, there's no real advantage
here except that preemption is enabled.

A simpler option is to make mmu_end() do a cond_flush():

   def mmu_begin():
       spin_lock(mmu_lock)

   def mmu_end():
       spin_unlock(mmu_lock)
       cond_flush

We need something for lockbreaking too:

   def mmu_lockbreak():
       if not (contended or need_resched):
           return False
       remember flush counter
       cond_resched_lock
       return flush counter changed

The caller would check the return value to see if it needs to redo
anything.  But this has the danger of long operations (like write
protecting a slot) never completing.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html