Re: [PATCH 3/6] KVM: Dirty memory tracking for performant checkpointing and improved live migration

"Huang, Kai" <kai.huang@xxxxxxxxxxxxxxx> · Wed, 4 May 2016 19:45:08 +1200

On 5/4/2016 2:11 AM, Radim Krčmář wrote:
2016-05-03 18:06+1200, Huang, Kai:
Actually my concern is, with your new mechanism to track guest dirty pages,
there will be two logdirty mechanisms (using bitmap and your per-vcpu list),
which I think is not good as it's a little bit redundant, given both
mechanisms are used for dirty page logging.

I think your main concern of current bitmap mechanism is scanning bitmap
takes lots of time, especially when only few pages get dirty, you still have
to scan the entire bitmap, which results in bad performance if you runs
checkpoint very frequently. My suggestion is, instead of introducing two
logdirty data structures, maybe you can try to use another more efficient
data structure instead of bitmap for both current logdirty mechanism and
your new interfaces. Maybe Xen's log-dirty tree is a good reference.

A sparse structure (buffer, tree, ...) also needs a mechanism to grow
(store new entries), so concurrent accesses become a problem, because
there has to be synchronization.  I think that per-vcpu structure
becomes mandatory when thousands VCPUs dirty memory at the same time.

Yes synchronization will be needed. But even for per-vcpu structure, we 
still need per-vcpu lock to access, say, gfn_list, right? For example, 
one thread from userspace trying to get and clear dirty pages would need 
to loop all vcpus and acquire each vcpu's lock for gfn_list. (see 
function mt_reset_all_gfns in patch 3/6). Looks this is not scalable 
neither?

                     Maybe Xen's log-dirty tree is a good reference.

Is there some top-level overview?

From a glance at the code, it looked like GPA bitmap sparsified with
radix tree in a manner similar to the page table hierarchy.

Yes it is just a radix tree. The point is the tree will be pretty small 
if there are few dirty pages, so the scanning will be very quick, 
comparing to bitmap.

Of course this is just my concern and I'll leave it to maintainers.

I too would prefer if both userspace interfaces used a common backend.
A possible backend for that is

  vcpu -> memslot -> sparse dirty log

This is the most reasonable proposal I think, at least for the first 
step to improve performance.

We should have dynamic sparse dirty log, to avoid wasting memory when
there are many small memslots, but a linear structure is probably still
fine.

The sparse dirty log structure can be allocated when necessary so I 
don't think it will waste of memory. Take radix tree as example, if 
there's no dirty page in the slot, the pointer to radix can be NULL, or 
just root entry.

We don't care which vcpu dirtied the page, so it seems like a waste to
have them in the hierarchy, but I can't think of designs where the
sparse dirty log is rooted in memslot and its updates scale well.

'memslot -> sparse dirty log' usually evolve into buffering on the VCPU
side before writing to the memslot or aren't efficient for sparse
dataset.

Where do you think is the balance between 'memslot -> bitmap' and
'vcpu -> memslot -> dirty buffer'?

In my opinion, we can first try 'memslot -> sparse dirty log'. Cao, Lei 
mentioned there were two bottlenecks: bitmap and bad multithread 
performance due to mmu_lock. I think 'memslot->sparse dirty log' might 
help to improve or solve the bitmap one.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html