On 5/4/2016 1:15 PM, Cao, Lei wrote: > On 5/4/2016 9:13 AM, Radim Krčmář wrote: >> 2016-05-04 19:45+1200, Huang, Kai: >>> On 5/4/2016 2:11 AM, Radim Krčmář wrote: >>>> 2016-05-03 18:06+1200, Huang, Kai: >>>>> Actually my concern is, with your new mechanism to track guest dirty pages, >>>>> there will be two logdirty mechanisms (using bitmap and your per-vcpu list), >>>>> which I think is not good as it's a little bit redundant, given both >>>>> mechanisms are used for dirty page logging. >>>>> >>>>> I think your main concern of current bitmap mechanism is scanning bitmap >>>>> takes lots of time, especially when only few pages get dirty, you still have >>>>> to scan the entire bitmap, which results in bad performance if you runs >>>>> checkpoint very frequently. My suggestion is, instead of introducing two >>>>> logdirty data structures, maybe you can try to use another more efficient >>>>> data structure instead of bitmap for both current logdirty mechanism and >>>>> your new interfaces. Maybe Xen's log-dirty tree is a good reference. >>>> >>>> A sparse structure (buffer, tree, ...) also needs a mechanism to grow >>>> (store new entries), so concurrent accesses become a problem, because >>>> there has to be synchronization. I think that per-vcpu structure >>>> becomes mandatory when thousands VCPUs dirty memory at the same time. >>> >>> Yes synchronization will be needed. But even for per-vcpu structure, we >>> still need per-vcpu lock to access, say, gfn_list, right? For example, one >>> thread from userspace trying to get and clear dirty pages would need to loop >>> all vcpus and acquire each vcpu's lock for gfn_list. (see function >>> mt_reset_all_gfns in patch 3/6). Looks this is not scalable neither? >> >> Coarse locking is optional. The list can be designed allow concurrent >> addition and removal (circullar buffer with 3 atomic markers). >> >> If we had 'vcpu -> memslot -> structure' then we would design the >> userspace interface so it would only affect one memslot, which would >> avoid any scalability issue even if there was a vcpu+memslot lock in >> each structure. >> >>>>> Maybe Xen's log-dirty tree is a good reference. >>>> >>>> Is there some top-level overview? >>>> >>>>> From a glance at the code, it looked like GPA bitmap sparsified with >>>> radix tree in a manner similar to the page table hierarchy. >>> >>> Yes it is just a radix tree. The point is the tree will be pretty small if >>> there are few dirty pages, so the scanning will be very quick, comparing to >>> bitmap. >> >> Bitmap had slow scanning, but any dynamic structure will have problems >> with insertion ... >> >> I think the tree might work if we pre-allotected bigger chunks to avoid >> allocation overhead and made it "lockless" (fine grained locking using >> cmpxchg) to avoid a bottleneck for concurrent writes. >> >>>> We should have dynamic sparse dirty log, to avoid wasting memory when >>>> there are many small memslots, but a linear structure is probably still >>>> fine. >>> >>> The sparse dirty log structure can be allocated when necessary so I don't >>> think it will waste of memory. Take radix tree as example, if there's no >>> dirty page in the slot, the pointer to radix can be NULL, or just root >>> entry. >> >> (And we want to waste some memory, because allocations are slow, >> tradeoffs, tradeoffs ...) >> >>>> We don't care which vcpu dirtied the page, so it seems like a waste to >>>> have them in the hierarchy, but I can't think of designs where the >>>> sparse dirty log is rooted in memslot and its updates scale well. >>>> >>>> 'memslot -> sparse dirty log' usually evolve into buffering on the VCPU >>>> side before writing to the memslot or aren't efficient for sparse >>>> dataset. >>>> >>>> Where do you think is the balance between 'memslot -> bitmap' and >>>> 'vcpu -> memslot -> dirty buffer'? >>> >>> In my opinion, we can first try 'memslot -> sparse dirty log'. Cao, Lei >>> mentioned there were two bottlenecks: bitmap and bad multithread performance >>> due to mmu_lock. I think 'memslot->sparse dirty log' might help to improve >>> or solve the bitmap one. >> >> The bimap was chosen because it scales well with concurrent writes and >> it easy to export. Lei already hit scalability issues with mmu_lock, so >> I don't expect that we could afford to put all VCPUs onto one lock >> elsewhere. >> >> Good designs so far seem to be: >> memslot -> lockless radix tree >> and >> vcpu -> memslot -> list (memslot -> vcpu -> list) >> > > There is no need for lookup, the dirty log is fetched in sequence, so why use > radix tree with added complexity but no benefit? > > List can be designed to be lockless, so memslot -> lockless fixed list? Never mind, lookup is needed to avoid duplicates. We can use list+bitmap, but it's obviously not as efficient as radix tree. > >> I'd like to see the lockless radix tree, but I expect the per-vcpu list >> to be *much* easier to implment. >> >> Do you see other designs on the pareto front? >> >> Thanks. >> > > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html