Re: [PATCH 3/6] KVM: Dirty memory tracking for performant checkpointing and improved live migration

"Cao, Lei" <Lei.Cao@xxxxxxxxxxx> · Wed, 4 May 2016 13:51:57 +0000

On 5/4/2016 9:13 AM, Radim Krčmář wrote:
> 2016-05-04 19:45+1200, Huang, Kai:
>> On 5/4/2016 2:11 AM, Radim Krčmář wrote:
>>> 2016-05-03 18:06+1200, Huang, Kai:
>>>> Actually my concern is, with your new mechanism to track guest dirty pages,
>>>> there will be two logdirty mechanisms (using bitmap and your per-vcpu list),
>>>> which I think is not good as it's a little bit redundant, given both
>>>> mechanisms are used for dirty page logging.
>>>>
>>>> I think your main concern of current bitmap mechanism is scanning bitmap
>>>> takes lots of time, especially when only few pages get dirty, you still have
>>>> to scan the entire bitmap, which results in bad performance if you runs
>>>> checkpoint very frequently. My suggestion is, instead of introducing two
>>>> logdirty data structures, maybe you can try to use another more efficient
>>>> data structure instead of bitmap for both current logdirty mechanism and
>>>> your new interfaces. Maybe Xen's log-dirty tree is a good reference.
>>>
>>> A sparse structure (buffer, tree, ...) also needs a mechanism to grow
>>> (store new entries), so concurrent accesses become a problem, because
>>> there has to be synchronization.  I think that per-vcpu structure
>>> becomes mandatory when thousands VCPUs dirty memory at the same time.
>>
>> Yes synchronization will be needed. But even for per-vcpu structure, we
>> still need per-vcpu lock to access, say, gfn_list, right? For example, one
>> thread from userspace trying to get and clear dirty pages would need to loop
>> all vcpus and acquire each vcpu's lock for gfn_list. (see function
>> mt_reset_all_gfns in patch 3/6). Looks this is not scalable neither?
> 
> Coarse locking is optional.  The list can be designed allow concurrent
> addition and removal (circullar buffer with 3 atomic markers).
> 

Agreed. 

Kai, the locking in mt_reset_all_gfns is read lock on mmu_lock, so there is no 
synchronization among userspace threads when clearing dirty pages. We changed it
from spin_lock to avoid the mmu_lock bottleneck.

> If we had 'vcpu -> memslot -> structure' then we would design the
> userspace interface so it would only affect one memslot, which would
> avoid any scalability issue even if there was a vcpu+memslot lock in
> each structure.
> 
>>>>                      Maybe Xen's log-dirty tree is a good reference.
>>>
>>> Is there some top-level overview?
>>>
>>>> From a glance at the code, it looked like GPA bitmap sparsified with
>>> radix tree in a manner similar to the page table hierarchy.
>>
>> Yes it is just a radix tree. The point is the tree will be pretty small if
>> there are few dirty pages, so the scanning will be very quick, comparing to
>> bitmap.
> 
> Bitmap had slow scanning, but any dynamic structure will have problems
> with insertion ...
> 
> I think the tree might work if we pre-allotected bigger chunks to avoid
> allocation overhead and made it "lockless" (fine grained locking using
> cmpxchg) to avoid a bottleneck for concurrent writes.
> 
>>> We should have dynamic sparse dirty log, to avoid wasting memory when
>>> there are many small memslots, but a linear structure is probably still
>>> fine.
>>
>> The sparse dirty log structure can be allocated when necessary so I don't
>> think it will waste of memory. Take radix tree as example, if there's no
>> dirty page in the slot, the pointer to radix can be NULL, or just root
>> entry.
> 
> (And we want to waste some memory, because allocations are slow,
>  tradeoffs, tradeoffs ...)
> 

Right, we want to avoid dynamic memory allocation/free during checkpoint cycles,
which is performance critical.

>>> We don't care which vcpu dirtied the page, so it seems like a waste to
>>> have them in the hierarchy, but I can't think of designs where the
>>> sparse dirty log is rooted in memslot and its updates scale well.
>>>
>>> 'memslot -> sparse dirty log' usually evolve into buffering on the VCPU
>>> side before writing to the memslot or aren't efficient for sparse
>>> dataset.
>>>
>>> Where do you think is the balance between 'memslot -> bitmap' and
>>> 'vcpu -> memslot -> dirty buffer'?
>>
>> In my opinion, we can first try 'memslot -> sparse dirty log'. Cao, Lei
>> mentioned there were two bottlenecks: bitmap and bad multithread performance
>> due to mmu_lock. I think 'memslot->sparse dirty log' might help to improve
>> or solve the bitmap one.
> 
> The bimap was chosen because it scales well with concurrent writes and
> it easy to export.  Lei already hit scalability issues with mmu_lock, so
> I don't expect that we could afford to put all VCPUs onto one lock
> elsewhere.
> 
> Good designs so far seem to be:
>  memslot -> lockless radix tree
> and
>  vcpu -> memslot -> list  (memslot -> vcpu -> list)
> 
> I'd like to see the lockless radix tree, but I expect the per-vcpu list
> to be *much* easier to implment.
> 
> Do you see other designs on the pareto front?
> 
> Thanks.
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html