Re: [PATCH 3/6] KVM: Dirty memory tracking for performant checkpointing and improved live migration

"Cao, Lei" <Lei.Cao@xxxxxxxxxxx> · Wed, 4 May 2016 18:33:03 +0000

On 5/4/2016 1:15 PM, Cao, Lei wrote:
> On 5/4/2016 9:13 AM, Radim Krčmář wrote:
>> 2016-05-04 19:45+1200, Huang, Kai:
>>> On 5/4/2016 2:11 AM, Radim Krčmář wrote:
>>>> 2016-05-03 18:06+1200, Huang, Kai:
>>>>> Actually my concern is, with your new mechanism to track guest dirty pages,
>>>>> there will be two logdirty mechanisms (using bitmap and your per-vcpu list),
>>>>> which I think is not good as it's a little bit redundant, given both
>>>>> mechanisms are used for dirty page logging.
>>>>>
>>>>> I think your main concern of current bitmap mechanism is scanning bitmap
>>>>> takes lots of time, especially when only few pages get dirty, you still have
>>>>> to scan the entire bitmap, which results in bad performance if you runs
>>>>> checkpoint very frequently. My suggestion is, instead of introducing two
>>>>> logdirty data structures, maybe you can try to use another more efficient
>>>>> data structure instead of bitmap for both current logdirty mechanism and
>>>>> your new interfaces. Maybe Xen's log-dirty tree is a good reference.
>>>>
>>>> A sparse structure (buffer, tree, ...) also needs a mechanism to grow
>>>> (store new entries), so concurrent accesses become a problem, because
>>>> there has to be synchronization.  I think that per-vcpu structure
>>>> becomes mandatory when thousands VCPUs dirty memory at the same time.
>>>
>>> Yes synchronization will be needed. But even for per-vcpu structure, we
>>> still need per-vcpu lock to access, say, gfn_list, right? For example, one
>>> thread from userspace trying to get and clear dirty pages would need to loop
>>> all vcpus and acquire each vcpu's lock for gfn_list. (see function
>>> mt_reset_all_gfns in patch 3/6). Looks this is not scalable neither?
>>
>> Coarse locking is optional.  The list can be designed allow concurrent
>> addition and removal (circullar buffer with 3 atomic markers).
>>
>> If we had 'vcpu -> memslot -> structure' then we would design the
>> userspace interface so it would only affect one memslot, which would
>> avoid any scalability issue even if there was a vcpu+memslot lock in
>> each structure.
>>
>>>>>                      Maybe Xen's log-dirty tree is a good reference.
>>>>
>>>> Is there some top-level overview?
>>>>
>>>>> From a glance at the code, it looked like GPA bitmap sparsified with
>>>> radix tree in a manner similar to the page table hierarchy.
>>>
>>> Yes it is just a radix tree. The point is the tree will be pretty small if
>>> there are few dirty pages, so the scanning will be very quick, comparing to
>>> bitmap.
>>
>> Bitmap had slow scanning, but any dynamic structure will have problems
>> with insertion ...
>>
>> I think the tree might work if we pre-allotected bigger chunks to avoid
>> allocation overhead and made it "lockless" (fine grained locking using
>> cmpxchg) to avoid a bottleneck for concurrent writes.
>>
>>>> We should have dynamic sparse dirty log, to avoid wasting memory when
>>>> there are many small memslots, but a linear structure is probably still
>>>> fine.
>>>
>>> The sparse dirty log structure can be allocated when necessary so I don't
>>> think it will waste of memory. Take radix tree as example, if there's no
>>> dirty page in the slot, the pointer to radix can be NULL, or just root
>>> entry.
>>
>> (And we want to waste some memory, because allocations are slow,
>>  tradeoffs, tradeoffs ...)
>>
>>>> We don't care which vcpu dirtied the page, so it seems like a waste to
>>>> have them in the hierarchy, but I can't think of designs where the
>>>> sparse dirty log is rooted in memslot and its updates scale well.
>>>>
>>>> 'memslot -> sparse dirty log' usually evolve into buffering on the VCPU
>>>> side before writing to the memslot or aren't efficient for sparse
>>>> dataset.
>>>>
>>>> Where do you think is the balance between 'memslot -> bitmap' and
>>>> 'vcpu -> memslot -> dirty buffer'?
>>>
>>> In my opinion, we can first try 'memslot -> sparse dirty log'. Cao, Lei
>>> mentioned there were two bottlenecks: bitmap and bad multithread performance
>>> due to mmu_lock. I think 'memslot->sparse dirty log' might help to improve
>>> or solve the bitmap one.
>>
>> The bimap was chosen because it scales well with concurrent writes and
>> it easy to export.  Lei already hit scalability issues with mmu_lock, so
>> I don't expect that we could afford to put all VCPUs onto one lock
>> elsewhere.
>>
>> Good designs so far seem to be:
>>  memslot -> lockless radix tree
>> and
>>  vcpu -> memslot -> list  (memslot -> vcpu -> list)
>>
> 
> There is no need for lookup, the dirty log is fetched in sequence, so why use
> radix tree with added complexity but no benefit?
> 
> List can be designed to be lockless, so memslot -> lockless fixed list?

Never mind, lookup is needed to avoid duplicates. We can use list+bitmap, but
it's obviously not as efficient as radix tree.

>  
>> I'd like to see the lockless radix tree, but I expect the per-vcpu list
>> to be *much* easier to implment.
>>
>> Do you see other designs on the pareto front?
>>
>> Thanks.
>>
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html