Re: [PATCH v2 0/4] KVM: Dirty memory tracking for performant checkpointing solutions

"Cao, Lei" <Lei.Cao@xxxxxxxxxxx> · Fri, 6 Jan 2017 13:46:28 +0000

On 1/5/2017 5:16 AM, Paolo Bonzini wrote:
> 
> 
> On 04/01/2017 21:39, Cao, Lei wrote:
>> This patch series adds memory tracking support for performant
>> checkpointing solutions. It can also be used by live migration
>> to improve predictability. 
>>
>> Introduction
>>
>> Brendan Cully's Remus project white paper is one of the best written on 
>> the subject of fault tolerance using checkpoint/rollback techniques and 
>> is the best place to start for a general background. 
>> (http://www.cs.ubc.ca/~andy/papers/remus-nsdi-final.pdf)  
>> It gives a great outline of the basic requirements and characteristics 
>> of a checkpointed system, including a few of the performance issues.  
>> But Remus did not go far enough in the area of system performance for 
>> commercial production.
>>
>> This patch series addresses known bottleneck and limitation in a 
>> checkpointed system: use of large bitmaps to track dirty memory.
>> These bitmaps are copied to userspace when userspace queries KVM for
>> its dirty page information. The use of bitmaps makes sense in the
>> live-migration method, as it is possible for all of memory to be dirtied
>> from one log-dirty pass to another. But in a checkpointed system, the
>> number of dirty pages is bounded such that the VM is paused when it has
>> dirtied a pre-defined number of pages. Traversing a large, sparsely
>> populated bitmap to find set bits is time-consuming, as is copying the
>> bitmap to user-space.
>>
>> The preferred data structure for performant checkpointing solutions is
>> a dense list of guest frame numbers (GFN). This patch series stores
>> the dirty list in kernel memory that can be memory mapped into 
>> userspace to allow speedy harvesting. Three new ioctls are implemented
>> to allow userspace to set the dirty log size, to get the dirty count
>> and to rearm the dirty traps. See patch 4 for details.
>>
>> The modification and still more modifications to qemu have allowed us 
>> to run checkpoint cycles at rates up to 2500 per second,  while still 
>> allowing the VM to get useful work done.
>>
>> Design Goals
>>
>> The patch series does not change or remove any existing KVM functionality.
>> It represents only additional functions (ioctls) into KVM from user space 
>> and these changes coexist with the current dirty memory logging facilities. 
>> It is possible to run multiple guests such that some of the guests
>> perform live migration using the existing memory logging mechanism and 
>> others migrate or run in fault tolerant mode using the new memory tracking 
>> functions.  
> 
> Hi,
> 
> before I review the patches, I have a few notes on the organization of
> the series and especially on the data structures.
> 
> Regarding the formatting of the patch series, please give different
> subjects to the four patches.  I see that you are using Office365 from
> the headers; if you are using IMAP too, you may want to look into "git
> imap-send".
> 
> Regarding the ordering of the patches, the documentation should be
> included in the same patch that defines the new ioctls and data structure.
> 
> Regarding the API, I think that KVM_GET_DIRTY_COUNT is not needed;
> userspace can do the same directly.  And since what you have is
> basically a ring buffer, I would like to have it be *exactly* a ring
> buffer.  In particular:
> 
> - there's no need to always reset indices from zero on
> KVM_RESET_DIRTY_PAGES.
> 
> - dirty_index should _not_ be in the user-space data structure.  This
> makes it safer - for example you are not masking the accesses through
> dirty_index, but user-space can put an out-of-bounds value there.  Each
> update to dirty_index can also update avail_index immediately, without
> waiting for KVM_GET_DIRTY_COUNT.
> 
> - there are effectively two consumers (userspace and kernel) and the
> kernel always lags behind userspace.  So there should be two
> fetch_indexes, one that is read by KVM_RESET_DIRTY_PAGES (in gfn_list_t)
> and one that is written by KVM_RESET_DIRTY_PAGES (elsewhere).  The
> kernel's copy of fetch_index is also consulted to check for a full ring
> buffer (with leeway.
> 
> - KVM_RESET_DIRTY_PAGES need not take the mmu_lock (it's a spinlock, so
> you'd have to do things such as spin_needbreak); it only needs a mutex
> (kvm->slots_lock will do) to protects against concurrent resets.  You
> only need to take kvm->mmu_lock around calls to
> kvm_arch_mmu_enable_log_dirty_pt_masked.
> 
> Because a ring buffer has a power-of-2 size, there is a problem of where
> to put the indices without wasting a lot of memory for the padding.
> There are two solutions that I can think of.
> 
> One solution (a bit hacky but it works) is to define
> 
> 	struct kvm_dirty_gfn {
> 		__u32 pad;
> 		__u32 slot; /* as_id | slot_id */
> 		__u64 offset;
> 	};
> 
> 	struct kvm_dirty_log_list {
> 		union {
> 			struct {
> 				__u16 prod; /* avail_index */
> 				__u16 cons; /* user fetch_index */
> 			} indices;
> 			struct kvm_dirty_gfn dirty_gfns[0];
> 		};
> 	};
> 
> 
> In other words, the indices are placed in the first four bytes of the
> ring, which would be unused anyway.  And everything is hidden behind an
> unnamed union.
> 
> With this change, KVM_GET_DIRTY_COUNT is not needed anymore.
> KVM_RESET_DIRTY_PAGES takes the userspace fetch_index and processes
> pages from the stored kernel fetch_index up to the userspace fetch_index.
> 
> Please split the gfn_list_t data structure into separate files
> virt/kvm/gfnring.c and include/linux/kvm_gfnring.h, with an abstract
> data type and a proper split of the kernel and userspace data structures.
> 
> 	struct kvm_gfn_ring_t {
> 		u16 prod_index;	/* dirty_index */
> 		u16 cons_index; /* kernel fetch_index */
> 		u32 size;
> 
> 		/* This should be vmalloced.  */
> 		struct kvm_dirty_log *data;
> 	};
> 
> 	int kvm_gfn_ring_alloc(struct kvm_gfn_ring_t *gfnring,
> 			       u32 size);
> 
> 	/*
> 	 * called with kvm->slots_lock held, returns true
> 	 * if at least one page was processed.
> 	 */
> 	bool kvm_gfn_ring_reset(struct kvm *kvm,
> 				struct kvm_gfn_ring_t *gfnring);
> 
> 	/*
> 	 * returns number of items in FIFO after adding the element,
> 	 * which is just prod_index - cons_index, or -EBUSY if it
> 	 * was full and the element could not be added.
> 	 */
> 	int kvm_gfn_ring_push(struct kvm *kvm,
> 			      struct kvm_gfn_ring_t *ring,
> 			      struct kvm_memory_slot *slot,
> 			      struct kvm_dirty_gfn *datum);
> 
> 	/* for use in vm_operations_struct */
> 	struct page *kvm_gfn_ring_get_page(struct kvm_gfn_ring_t *,
> 					   u32 i);
> 
> 	void kvm_gfn_ring_free(struct kvm_gfn_ring_t *ring);
> 
> Defining this data type then can be the first patch in the series.
> 
> Paolo
> 
> 

Thanks a lot for all the feedback! I'll rework the patches.

Lei

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html