Re: [PATCH 04/12] KVM: MMU: log dirty page after marking spte writable

Xiao Guangrong <xiaoguangrong.eric@xxxxxxxxx> · Fri, 9 Aug 2013 00:26:09 +0800

[ Post again after adjusting the format since the mail list rejected to deliver my previous one. ]

On Aug 8, 2013, at 11:06 PM, Marcelo Tosatti <mtosatti@xxxxxxxxxx> wrote:

> On Wed, Aug 07, 2013 at 12:06:49PM +0800, Xiao Guangrong wrote:
>> On 08/07/2013 09:48 AM, Marcelo Tosatti wrote:
>>> On Tue, Jul 30, 2013 at 09:02:02PM +0800, Xiao Guangrong wrote:
>>>> Make sure we can see the writable spte before the dirt bitmap is visible
>>>> 
>>>> We do this is for kvm_vm_ioctl_get_dirty_log() write-protects the spte based
>>>> on the dirty bitmap, we should ensure the writable spte can be found in rmap
>>>> before the dirty bitmap is visible. Otherwise, we cleared the dirty bitmap and
>>>> failed to write-protect the page
>>>> 
>>>> Signed-off-by: Xiao Guangrong <xiaoguangrong@xxxxxxxxxxxxxxxxxx>
>>>> ---
>>>> arch/x86/kvm/mmu.c | 6 +++---
>>>> 1 file changed, 3 insertions(+), 3 deletions(-)
>>> 
>>> Can you explain why this is safe, with regard to the rule 
>>> at edde99ce05290e50 ?
>> 
>> BTW, this log fixed this case:
>> 
>>     VCPU 0                KVM migration control
>> 
>>                           write-protects all pages
>> #Pf happen then the page
>> become writable, set dirty
>> bit on the bitmap
>> 
>>                          swap the bitmap, current bitmap is empty
>> 
>> write the page (no dirty log)
>> 
>>                          stop the guest and push
>>                          the remaining dirty pages
>> Stopped
>>                          See current bitmap is empty that means
>>                          no page is dirty.
>>> 
>>> "The rule is that all pages are either dirty in the current bitmap,
>>> or write-protected, which is violated here."
>> 
>> Actually, this rule is not complete true, there's the 3th case:
>> the window between write guest page and set dirty bitmap is valid.
>> In that window, page is write-free and not dirty logged.
>> 
>> This case is based on the fact that at the final step of live migration,
>> kvm should stop the guest and push the remaining dirty pages to the
>> destination.
>> 
>> They're some examples in the current code:
>> example 1, in fast_pf_fix_direct_spte():
>> 	if (cmpxchg64(sptep, spte, spte | PT_WRITABLE_MASK) == spte)
>> 		/* The window in here... */
>> 		mark_page_dirty(vcpu->kvm, gfn);
>> 
>> example 2, in kvm_write_guest_page():
>> 	r = __copy_to_user((void __user *)addr + offset, data, len);
>> 	if (r)
>> 		return -EFAULT;
>> 	/*
>> 	 * The window is here, the page is dirty but not logged in
>>         * The bitmap.
>> 	 */
>> 	mark_page_dirty(kvm, gfn);
>> 	return 0;
>> 
>>> 
>>> Overall, please document what is the required order of operations for
>>> both set_spte and get_dirty_log and why this order is safe (right on top
>>> of the code).
>> 
>> Okay.
>> 
>> The order we required here is, we should 1) set spte to writable __before__
>> set the dirty bitmap and 2) add the spte into rmap __before__ set the dirty
>> bitmap.
>> 
>> The point 1) is the same as fast_pf_fix_direct_spte(), which i explained above.
>> The point 1) and 2) can ensure we can find the spte on rmap and see the spte is
>> writable when we do lockless write-protection, otherwise these cases will happen
>> 
>> VCPU 0					kvm ioctl doing get-dirty-pages
>> 
>> mark_page_dirty(gfn) which
>> set the gfn on the dirty maps
>>                                      mask = xchg(dirty_bitmap, 0)
>> 
>>                                      walk all gfns which set on "mask" and
>>                                      locklessly write-protect the gfn,
>>                                      then walk rmap, see no spte on that rmap
>> 	
>> 
>> add the spte into rmap
>> 
>> !!!!!! Then the page can be freely wrote but not recorded in the dirty bitmap.
>> 
>> Or:
>> 
>> VCPU 0					kvm ioctl doing get-dirty-pages
>> 
>> mark_page_dirty(gfn) which
>> set the gfn on the dirty maps
>> 
>> add spte into rmap
>>                                      mask = xchg(dirty_bitmap, 0)
>> 
>>                                      walk all gfns which set on "mask" and
>>                                      locklessly write-protect the gfn,
>>                                      then walk rmap, see spte is on the ramp
>>                                      but it is readonly or nonpresent.
>> 	
>> Mark spte writable
>> 
>> !!!!!! Then the page can be freely wrote but not recorded in the dirty bitmap.
>> 
>> Hopefully, i have clarified it, if you have any questions, please let me know.
> 
> Yes, partially. Please on top of the explanation above, have something along
> the lines of
> 
> "The flush IPI assumes that a thread switch happens in this order"
> comment at arch/x86/mm/tlb.c
> 
> "With get_user_pages_fast, we walk down the pagetables without taking"
> comment at arch/x86/mm/gup.c

Marcelo, thanks for your suggestion, i will improve both the changelog and
the comments in the next version.

> 
> 
> What about the relation between read-only spte and respective TLB flush?
> That is:
> 
> vcpu0					vcpu1
> 
> lockless write protect
> mark spte read-only
> 					either some mmu-lockless path or not
> 					write protect page:
> 					find read-only page
> 					assumes tlb is flushed

The tlb flush on mmu-lockless paths do not depends on the spte, there are
two lockless paths: one is kvm_mmu_slot_remove_write_access() which
unconditionally flushes tlb,  another one is kvm_vm_ioctl_get_dirty_log()
which flushes tlb based on dirty bitmap (flush tlb if have dirty page).

Under the protection of mmu-lock, since we flush tlb whenever spte is zapped,
we only need to care the case of spte updating which is fixed in 
"[PATCH 06/12] KVM: MMU: flush tlb if the spte can be locklessly modified", in
that patch, it changes the flush-condition to

-	if (is_writable_pte(old_spte) && !is_writable_pte(new_spte))
+	if (spte_is_locklessly_modifiable(old_spte) &&
+	      !is_writable_pte(new_spte))

That means whenever a spte which can be potentially locklessly-modified
becomes readonly, do the tlb flush.

> 				
> kvm_flush_remote_tlbs
> 
> 
> In general, i think write protection is a good candidate for lockless operation.
> However, i would also be concerned about the larger problem of mmu-lock 
> contention and how to solve it. Did you ever consider splitting it?

Yes, i did and actaully am still working on that. :)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html