Re: [RFC Design Doc]Speed up live migration by skipping free pages

"Dr. David Alan Gilbert" <dgilbert@xxxxxxxxxx> · Tue, 22 Mar 2016 19:05:31 +0000

* Liang Li (liang.z.li@xxxxxxxxx) wrote:
> I have sent the RFC version patch set for live migration optimization
> by skipping processing the free pages in the ram bulk stage and
> received a lot of comments. The related threads can be found at:

Thanks!

> Obviously, the virtio-balloon mechanism has a bigger performance
> impact to the guest than the way we are trying to implement.

Yeh, we should separately try and fix that; if it's that slow then
people will be annoyed about it when they're just using it for balloon.

> 3. Virtio interface
> There are three different ways of using the virtio interface to
> send the free page information.
> a. Extend the current virtio device
> The virtio spec has already defined some virtio devices, and we can
> extend one of these devices so as to use it to transport the free page
> information. It requires modifying the virtio spec.
> 
> b. Implement a new virtio device
> Implementing a brand new virtio device to exchange information
> between host and guest is another choice. It requires modifying the
> virtio spec too.

If the right solution is to change the spec then we should do it;
we shouldn't use a technically worse solution just to avoid the spec
change; although we have to be even more careful to get the right
solution if we want to change the spec.

> c. Make use of virtio-serial (Amit’s suggestion, my choice)
> It’s possible to make use the virtio-serial for communication between
> host and guest, the benefit of this solution is no need to modify the
> virtio spec. 
> 
> 4. Construct free page bitmap
> To minimize the space for saving free page information, it’s better to
> use a bitmap to describe the free pages. There are two ways to
> construct the free page bitmap.
> 
> a. Construct free page bitmap when demand (My choice)
> Guest can allocate memory for the free page bitmap only when it
> receives the request from QEMU, and set the free page bitmap by
> traversing the free page list. The advantage of this way is that it’s
> quite simple and easy to implement. The disadvantage is that the
> traversing operation may consume quite a long time when there are a
> lot of free pages. (About 20ms for 7GB free pages)

I wonder how that scales; 20ms isn't too bad - but I'm more worried about
what happens when someone does it to the 1TB database VM.

> b. Update free page bitmap when allocating/freeing pages 
> Another choice is to allocate the memory for the free page bitmap
> when guest boots, and then update the free page bitmap when
> allocating/freeing pages. It needs more modification to the code
> related to memory management in guest. The advantage of this way is
> that guest can response QEMU’s request for a free page bitmap very
> quickly, no matter how many free pages in the guest. Do the kernel guys
> like this?
> 
> 5. Tighten the free page bitmap
> At last, the free page bitmap should be operated with the
> ramlist.dirty_memory to filter out the free pages. We should make sure
> the bit N in the free page bitmap and the bit N in the
> ramlist.dirty_memory are corresponding to the same guest’s page. 
> Some arch, like X86, there are ‘holes’ in the memory’s physical
> address, which means there are no actual physical RAM pages
> corresponding to some PFNs. So, some arch specific information is
> needed to construct a proper free page bitmap.
> 
> migration dirty page bitmap:
>     ---------------------
>     |a|b|c|d|e|f|g|h|i|j|
>     ---------------------
> loose free page bitmap:
>     -----------------------------  
>     |a|b|c|d|e|f| | | | |g|h|i|j|
>     -----------------------------
> tight free page bitmap:
>     ---------------------
>     |a|b|c|d|e|f|g|h|i|j|
>     ---------------------
> 
> There are two places for tightening the free page bitmap:
> a. In guest 
> Constructing the free page bitmap in guest requires adding the arch
> related code in guest for building a tight bitmap. The advantage of
> this way is that less memory is needed to store the free page bitmap.
> b. In QEMU (My choice)
> Constructing the free page bitmap in QEMU is more flexible, we can get
> a loose free page bitmap which contains the holes, and then filter out
> the holes in QEMU, the advantage of this way is that we can keep the
> kernel code as simple as we can, the disadvantage is that more memory
> is needed to save the loose free page bitmap. Because this is a mainly
> QEMU feature, if possible, do all the related things in QEMU is
> better.

Yes, maybe; although we'd have to be careful to validate what the guest
fills in makes sense.

> 6. Handling page cache in the guest
> The memory used for page cache in the guest will change depends on the
> workload, if guest run some block IO intensive work load, there will
> be lots of pages used for page cache, only a few free pages are left in
> the guest. In order to get more free pages, we can select to ask guest
> to drop some page caches.  Because dropping the page cache may lead to
> performance degradation, only the clean cache should be dropped and we
> should let the user decide whether to do this.
> 
> 7. APIs for live migration
> To make things work, the following APIs should be implemented.
> 
> a. Get memory info of the guest, like this:
> bool get_guest_mem_info(struct guest_mem_info  * info )
> 
> struct guest_mem_info is defined as bellow:
> 
> struct guest_mem_info {
> uint64_t free_pages_num;      // guest’s free pages count 
> uint64_t cached_pages_num;     //total cached pages count
> uint64_t max_pfn;     // the max pfn of the guest
> };

What do you need max_pfn for?

(We'll also have to think how hotplugged memory works with this).
Also be careful of how big a page is;  some architectures
can choose between different guest page sizes (4, 16, 64k I think on ARM),
so we just need to make sure what unit we're dealing with.  That size
is also not necessarily the same as the unit size of the migration
bitmap; this is always a bit tricky.

> Return value:
> flase, when QEMU or guest can’t support this operation.
> true, when success.
> 
> b. Request guest’s current free pages information.
> int get_free_page_bmap(unsigned long *bitmap,  bool drop_cache);
> 
> Return value:
> -1, when QEMU or guest can’t support this operation.
> 1, when the free page bitmap is still in the progress of constructing.
> 1, when a valid free page bitmap is ready.

I suggest not using 'long' - I know we do it a lot in QEMU but it's a pain;
lets nail this down to a uint64_t and then we don't have to worry about
what the guest is runing.

> c. Tighten the free page bitmap
> unsigned long * tighten_free_page_bmap(unsigned long *bitmap);
> 
> This function is an arch specific function to rebuild the loose free
> page bitmap so as to get a tight bitmap which can be operated easily
> with ramlist.dirty_memory.

I'm not sure you actually need this; as long as what you expect is just
a (small) series of chunks of bitmap; then you'd just have
something like (start at 0... ) (start at 1MB...) (start at 1GB....)

> 8. Pseudo code 
> Dirty page logging should be enabled before getting the free page
> information from guest, this is important because during the process
> of getting free pages, some free pages may be used and written by the
> guest, dirty page logging can trace these pages. The pseudo code is
> like below:
> 
>     -----------------------------------------------
>     MigrationState *s = migrate_get_current();
>     ...
> 
>     memory_global_dirty_log_start();
> 
>     if (get_guest_mem_info(&info)) {
>         while (!get_free_page_bmap(free_page_bitmap,  drop_page_cache) &&
>                s->state != MIGRATION_STATUS_CANCELLING) {
>             usleep(1000) // sleep for 1 ms
>         }
> 
>         tighten_free_page_bmap = tighten_guest_free_pages(free_page_bitmap);
>         filter_out_guest_free_pages(tighten_free_page_bmap);
>     }

Given the typical speed of networks; it wouldn't do too much
harm to start sending assuming all pages are dirty and then
when the guest finally gets around to finishing the bitmap
then update, so it's asynchronous - and then if the guest
never responds we don't really care.

Dave

> 
>     migration_bitmap_sync();
>     ...
> 
>     -----------------------------------------------
> 
> 
> -- 
> 1.9.1
> 
--
Dr. David Alan Gilbert / dgilbert@xxxxxxxxxx / Manchester, UK
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html