On 04/01/2017 21:39, Cao, Lei wrote: > This patch series adds memory tracking support for performant > checkpointing solutions. It can also be used by live migration > to improve predictability. > > Introduction > > Brendan Cully's Remus project white paper is one of the best written on > the subject of fault tolerance using checkpoint/rollback techniques and > is the best place to start for a general background. > (http://www.cs.ubc.ca/~andy/papers/remus-nsdi-final.pdf) > It gives a great outline of the basic requirements and characteristics > of a checkpointed system, including a few of the performance issues. > But Remus did not go far enough in the area of system performance for > commercial production. > > This patch series addresses known bottleneck and limitation in a > checkpointed system: use of large bitmaps to track dirty memory. > These bitmaps are copied to userspace when userspace queries KVM for > its dirty page information. The use of bitmaps makes sense in the > live-migration method, as it is possible for all of memory to be dirtied > from one log-dirty pass to another. But in a checkpointed system, the > number of dirty pages is bounded such that the VM is paused when it has > dirtied a pre-defined number of pages. Traversing a large, sparsely > populated bitmap to find set bits is time-consuming, as is copying the > bitmap to user-space. > > The preferred data structure for performant checkpointing solutions is > a dense list of guest frame numbers (GFN). This patch series stores > the dirty list in kernel memory that can be memory mapped into > userspace to allow speedy harvesting. Three new ioctls are implemented > to allow userspace to set the dirty log size, to get the dirty count > and to rearm the dirty traps. See patch 4 for details. > > The modification and still more modifications to qemu have allowed us > to run checkpoint cycles at rates up to 2500 per second, while still > allowing the VM to get useful work done. > > Design Goals > > The patch series does not change or remove any existing KVM functionality. > It represents only additional functions (ioctls) into KVM from user space > and these changes coexist with the current dirty memory logging facilities. > It is possible to run multiple guests such that some of the guests > perform live migration using the existing memory logging mechanism and > others migrate or run in fault tolerant mode using the new memory tracking > functions. Hi, before I review the patches, I have a few notes on the organization of the series and especially on the data structures. Regarding the formatting of the patch series, please give different subjects to the four patches. I see that you are using Office365 from the headers; if you are using IMAP too, you may want to look into "git imap-send". Regarding the ordering of the patches, the documentation should be included in the same patch that defines the new ioctls and data structure. Regarding the API, I think that KVM_GET_DIRTY_COUNT is not needed; userspace can do the same directly. And since what you have is basically a ring buffer, I would like to have it be *exactly* a ring buffer. In particular: - there's no need to always reset indices from zero on KVM_RESET_DIRTY_PAGES. - dirty_index should _not_ be in the user-space data structure. This makes it safer - for example you are not masking the accesses through dirty_index, but user-space can put an out-of-bounds value there. Each update to dirty_index can also update avail_index immediately, without waiting for KVM_GET_DIRTY_COUNT. - there are effectively two consumers (userspace and kernel) and the kernel always lags behind userspace. So there should be two fetch_indexes, one that is read by KVM_RESET_DIRTY_PAGES (in gfn_list_t) and one that is written by KVM_RESET_DIRTY_PAGES (elsewhere). The kernel's copy of fetch_index is also consulted to check for a full ring buffer (with leeway. - KVM_RESET_DIRTY_PAGES need not take the mmu_lock (it's a spinlock, so you'd have to do things such as spin_needbreak); it only needs a mutex (kvm->slots_lock will do) to protects against concurrent resets. You only need to take kvm->mmu_lock around calls to kvm_arch_mmu_enable_log_dirty_pt_masked. Because a ring buffer has a power-of-2 size, there is a problem of where to put the indices without wasting a lot of memory for the padding. There are two solutions that I can think of. One solution (a bit hacky but it works) is to define struct kvm_dirty_gfn { __u32 pad; __u32 slot; /* as_id | slot_id */ __u64 offset; }; struct kvm_dirty_log_list { union { struct { __u16 prod; /* avail_index */ __u16 cons; /* user fetch_index */ } indices; struct kvm_dirty_gfn dirty_gfns[0]; }; }; In other words, the indices are placed in the first four bytes of the ring, which would be unused anyway. And everything is hidden behind an unnamed union. With this change, KVM_GET_DIRTY_COUNT is not needed anymore. KVM_RESET_DIRTY_PAGES takes the userspace fetch_index and processes pages from the stored kernel fetch_index up to the userspace fetch_index. Please split the gfn_list_t data structure into separate files virt/kvm/gfnring.c and include/linux/kvm_gfnring.h, with an abstract data type and a proper split of the kernel and userspace data structures. struct kvm_gfn_ring_t { u16 prod_index; /* dirty_index */ u16 cons_index; /* kernel fetch_index */ u32 size; /* This should be vmalloced. */ struct kvm_dirty_log *data; }; int kvm_gfn_ring_alloc(struct kvm_gfn_ring_t *gfnring, u32 size); /* * called with kvm->slots_lock held, returns true * if at least one page was processed. */ bool kvm_gfn_ring_reset(struct kvm *kvm, struct kvm_gfn_ring_t *gfnring); /* * returns number of items in FIFO after adding the element, * which is just prod_index - cons_index, or -EBUSY if it * was full and the element could not be added. */ int kvm_gfn_ring_push(struct kvm *kvm, struct kvm_gfn_ring_t *ring, struct kvm_memory_slot *slot, struct kvm_dirty_gfn *datum); /* for use in vm_operations_struct */ struct page *kvm_gfn_ring_get_page(struct kvm_gfn_ring_t *, u32 i); void kvm_gfn_ring_free(struct kvm_gfn_ring_t *ring); Defining this data type then can be the first patch in the series. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html