On 1/5/2017 5:16 AM, Paolo Bonzini wrote: > > > On 04/01/2017 21:39, Cao, Lei wrote: >> This patch series adds memory tracking support for performant >> checkpointing solutions. It can also be used by live migration >> to improve predictability. >> >> Introduction >> >> Brendan Cully's Remus project white paper is one of the best written on >> the subject of fault tolerance using checkpoint/rollback techniques and >> is the best place to start for a general background. >> (http://www.cs.ubc.ca/~andy/papers/remus-nsdi-final.pdf) >> It gives a great outline of the basic requirements and characteristics >> of a checkpointed system, including a few of the performance issues. >> But Remus did not go far enough in the area of system performance for >> commercial production. >> >> This patch series addresses known bottleneck and limitation in a >> checkpointed system: use of large bitmaps to track dirty memory. >> These bitmaps are copied to userspace when userspace queries KVM for >> its dirty page information. The use of bitmaps makes sense in the >> live-migration method, as it is possible for all of memory to be dirtied >> from one log-dirty pass to another. But in a checkpointed system, the >> number of dirty pages is bounded such that the VM is paused when it has >> dirtied a pre-defined number of pages. Traversing a large, sparsely >> populated bitmap to find set bits is time-consuming, as is copying the >> bitmap to user-space. >> >> The preferred data structure for performant checkpointing solutions is >> a dense list of guest frame numbers (GFN). This patch series stores >> the dirty list in kernel memory that can be memory mapped into >> userspace to allow speedy harvesting. Three new ioctls are implemented >> to allow userspace to set the dirty log size, to get the dirty count >> and to rearm the dirty traps. See patch 4 for details. >> >> The modification and still more modifications to qemu have allowed us >> to run checkpoint cycles at rates up to 2500 per second, while still >> allowing the VM to get useful work done. >> >> Design Goals >> >> The patch series does not change or remove any existing KVM functionality. >> It represents only additional functions (ioctls) into KVM from user space >> and these changes coexist with the current dirty memory logging facilities. >> It is possible to run multiple guests such that some of the guests >> perform live migration using the existing memory logging mechanism and >> others migrate or run in fault tolerant mode using the new memory tracking >> functions. > > Hi, > > before I review the patches, I have a few notes on the organization of > the series and especially on the data structures. > > Regarding the formatting of the patch series, please give different > subjects to the four patches. I see that you are using Office365 from > the headers; if you are using IMAP too, you may want to look into "git > imap-send". > > Regarding the ordering of the patches, the documentation should be > included in the same patch that defines the new ioctls and data structure. > > Regarding the API, I think that KVM_GET_DIRTY_COUNT is not needed; > userspace can do the same directly. And since what you have is > basically a ring buffer, I would like to have it be *exactly* a ring > buffer. In particular: > > - there's no need to always reset indices from zero on > KVM_RESET_DIRTY_PAGES. > > - dirty_index should _not_ be in the user-space data structure. This > makes it safer - for example you are not masking the accesses through > dirty_index, but user-space can put an out-of-bounds value there. Each > update to dirty_index can also update avail_index immediately, without > waiting for KVM_GET_DIRTY_COUNT. > > - there are effectively two consumers (userspace and kernel) and the > kernel always lags behind userspace. So there should be two > fetch_indexes, one that is read by KVM_RESET_DIRTY_PAGES (in gfn_list_t) > and one that is written by KVM_RESET_DIRTY_PAGES (elsewhere). The > kernel's copy of fetch_index is also consulted to check for a full ring > buffer (with leeway. > > - KVM_RESET_DIRTY_PAGES need not take the mmu_lock (it's a spinlock, so > you'd have to do things such as spin_needbreak); it only needs a mutex > (kvm->slots_lock will do) to protects against concurrent resets. You > only need to take kvm->mmu_lock around calls to > kvm_arch_mmu_enable_log_dirty_pt_masked. > > Because a ring buffer has a power-of-2 size, there is a problem of where > to put the indices without wasting a lot of memory for the padding. > There are two solutions that I can think of. > > One solution (a bit hacky but it works) is to define > > struct kvm_dirty_gfn { > __u32 pad; > __u32 slot; /* as_id | slot_id */ > __u64 offset; > }; > > struct kvm_dirty_log_list { > union { > struct { > __u16 prod; /* avail_index */ > __u16 cons; /* user fetch_index */ > } indices; > struct kvm_dirty_gfn dirty_gfns[0]; > }; > }; > > > In other words, the indices are placed in the first four bytes of the > ring, which would be unused anyway. And everything is hidden behind an > unnamed union. > > With this change, KVM_GET_DIRTY_COUNT is not needed anymore. > KVM_RESET_DIRTY_PAGES takes the userspace fetch_index and processes > pages from the stored kernel fetch_index up to the userspace fetch_index. > > Please split the gfn_list_t data structure into separate files > virt/kvm/gfnring.c and include/linux/kvm_gfnring.h, with an abstract > data type and a proper split of the kernel and userspace data structures. > > struct kvm_gfn_ring_t { > u16 prod_index; /* dirty_index */ > u16 cons_index; /* kernel fetch_index */ > u32 size; > > /* This should be vmalloced. */ > struct kvm_dirty_log *data; > }; > > int kvm_gfn_ring_alloc(struct kvm_gfn_ring_t *gfnring, > u32 size); > > /* > * called with kvm->slots_lock held, returns true > * if at least one page was processed. > */ > bool kvm_gfn_ring_reset(struct kvm *kvm, > struct kvm_gfn_ring_t *gfnring); > > /* > * returns number of items in FIFO after adding the element, > * which is just prod_index - cons_index, or -EBUSY if it > * was full and the element could not be added. > */ > int kvm_gfn_ring_push(struct kvm *kvm, > struct kvm_gfn_ring_t *ring, > struct kvm_memory_slot *slot, > struct kvm_dirty_gfn *datum); > > /* for use in vm_operations_struct */ > struct page *kvm_gfn_ring_get_page(struct kvm_gfn_ring_t *, > u32 i); > > void kvm_gfn_ring_free(struct kvm_gfn_ring_t *ring); > > Defining this data type then can be the first patch in the series. > > Paolo > > Thanks a lot for all the feedback! I'll rework the patches. Lei -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html