On 5/4/2016 9:13 AM, Radim Krčmář wrote: > 2016-05-04 19:45+1200, Huang, Kai: >> On 5/4/2016 2:11 AM, Radim Krčmář wrote: >>> 2016-05-03 18:06+1200, Huang, Kai: >>>> Actually my concern is, with your new mechanism to track guest dirty pages, >>>> there will be two logdirty mechanisms (using bitmap and your per-vcpu list), >>>> which I think is not good as it's a little bit redundant, given both >>>> mechanisms are used for dirty page logging. >>>> >>>> I think your main concern of current bitmap mechanism is scanning bitmap >>>> takes lots of time, especially when only few pages get dirty, you still have >>>> to scan the entire bitmap, which results in bad performance if you runs >>>> checkpoint very frequently. My suggestion is, instead of introducing two >>>> logdirty data structures, maybe you can try to use another more efficient >>>> data structure instead of bitmap for both current logdirty mechanism and >>>> your new interfaces. Maybe Xen's log-dirty tree is a good reference. >>> >>> A sparse structure (buffer, tree, ...) also needs a mechanism to grow >>> (store new entries), so concurrent accesses become a problem, because >>> there has to be synchronization. I think that per-vcpu structure >>> becomes mandatory when thousands VCPUs dirty memory at the same time. >> >> Yes synchronization will be needed. But even for per-vcpu structure, we >> still need per-vcpu lock to access, say, gfn_list, right? For example, one >> thread from userspace trying to get and clear dirty pages would need to loop >> all vcpus and acquire each vcpu's lock for gfn_list. (see function >> mt_reset_all_gfns in patch 3/6). Looks this is not scalable neither? > > Coarse locking is optional. The list can be designed allow concurrent > addition and removal (circullar buffer with 3 atomic markers). > > If we had 'vcpu -> memslot -> structure' then we would design the > userspace interface so it would only affect one memslot, which would > avoid any scalability issue even if there was a vcpu+memslot lock in > each structure. > >>>> Maybe Xen's log-dirty tree is a good reference. >>> >>> Is there some top-level overview? >>> >>>> From a glance at the code, it looked like GPA bitmap sparsified with >>> radix tree in a manner similar to the page table hierarchy. >> >> Yes it is just a radix tree. The point is the tree will be pretty small if >> there are few dirty pages, so the scanning will be very quick, comparing to >> bitmap. > > Bitmap had slow scanning, but any dynamic structure will have problems > with insertion ... > > I think the tree might work if we pre-allotected bigger chunks to avoid > allocation overhead and made it "lockless" (fine grained locking using > cmpxchg) to avoid a bottleneck for concurrent writes. > >>> We should have dynamic sparse dirty log, to avoid wasting memory when >>> there are many small memslots, but a linear structure is probably still >>> fine. >> >> The sparse dirty log structure can be allocated when necessary so I don't >> think it will waste of memory. Take radix tree as example, if there's no >> dirty page in the slot, the pointer to radix can be NULL, or just root >> entry. > > (And we want to waste some memory, because allocations are slow, > tradeoffs, tradeoffs ...) > >>> We don't care which vcpu dirtied the page, so it seems like a waste to >>> have them in the hierarchy, but I can't think of designs where the >>> sparse dirty log is rooted in memslot and its updates scale well. >>> >>> 'memslot -> sparse dirty log' usually evolve into buffering on the VCPU >>> side before writing to the memslot or aren't efficient for sparse >>> dataset. >>> >>> Where do you think is the balance between 'memslot -> bitmap' and >>> 'vcpu -> memslot -> dirty buffer'? >> >> In my opinion, we can first try 'memslot -> sparse dirty log'. Cao, Lei >> mentioned there were two bottlenecks: bitmap and bad multithread performance >> due to mmu_lock. I think 'memslot->sparse dirty log' might help to improve >> or solve the bitmap one. > > The bimap was chosen because it scales well with concurrent writes and > it easy to export. Lei already hit scalability issues with mmu_lock, so > I don't expect that we could afford to put all VCPUs onto one lock > elsewhere. > > Good designs so far seem to be: > memslot -> lockless radix tree > and > vcpu -> memslot -> list (memslot -> vcpu -> list) > There is no need for lookup, the dirty log is fetched in sequence, so why use radix tree with added complexity but no benefit? List can be designed to be lockless, so memslot -> lockless fixed list? > I'd like to see the lockless radix tree, but I expect the per-vcpu list > to be *much* easier to implment. > > Do you see other designs on the pareto front? > > Thanks. > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html