Re: [RFC Design Doc]Speed up live migration by skipping free pages

"Michael S. Tsirkin" <mst@xxxxxxxxxx> · Tue, 22 Mar 2016 12:11:16 +0200

On Tue, Mar 22, 2016 at 03:43:49PM +0800, Liang Li wrote:
> I have sent the RFC version patch set for live migration optimization
> by skipping processing the free pages in the ram bulk stage and
> received a lot of comments. The related threads can be found at:
> 
> https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00715.html
> https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00714.html
> https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00717.html
> https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00716.html
> https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00718.html
> 
> https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00719.html 
> https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00720.html
> https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00721.html
> 
> To make things easier, I wrote this doc about the possible designs
> and my choices. Comments are welcome! 

Thanks for putting this together, and especially for taking the trouble
to benchmark existing code paths!

I think these numbers do show that there are gains to be had from merging your code
with the existing balloon device. It will probably be a bit more work,
but I think it'll be worth it.

More comments below.

> Content
> =======
> 1. Background
> 2. Why not use virtio-balloon
> 3. Virtio interface
> 4. Constructing free page bitmap
> 5. Tighten free page bitmap
> 6. Handling page cache in the guest
> 7. APIs for live migration
> 8. Pseudo code 
> 
> Details
> =======
> 1. Background
> As we know, in the ram bulk stage of live migration, current QEMU live
> migration implementation mark the all guest's RAM pages as dirtied in
> the ram bulk stage, all these pages will be checked for zero page
> first, and the page content will be sent to the destination depends on
> the checking result, that process consumes quite a lot of CPU cycles
> and network bandwidth.
> 
> >From guest's point of view, there are some pages currently not used by
> the guest, guest doesn't care about the content in these pages. Free
> pages are this kind of pages which are not used by guest. We can make
> use of this fact and skip processing the free pages in the ram bulk
> stage, it can save a lot CPU cycles and reduce the network traffic
> while speed up the live migration process obviously.
> 
> Usually, only the guest has the information of free pages. But it’s
> possible to let the guest tell QEMU it’s free page information by some
> mechanism. E.g. Through the virtio interface. Once QEMU get the free
> page information, it can skip processing these free pages in the ram
> bulk stage by clearing the corresponding bit of the migration bitmap. 
> 
> 2. Why not use virtio-balloon 
> Actually, the virtio-balloon can do the similar thing by inflating the
> balloon before live migration, but its performance is no good, for an
> 8GB idle guest just boots, it takes about 5.7 Sec to inflate the
> balloon to 7GB, but it only takes 25ms to get a valid free page bitmap
> from the guest.  There are some of reasons for the bad performance of
> vitio-balloon:
> a. allocating pages (5%, 304ms)

Interesting. This is definitely worth improving in guest kernel.
Also, will it be faster if we allocate and pass to guest huge pages instead?
Might speed up madvise as well.

> b. sending PFNs to host (71%, 4194ms)

OK, so we probably should teach balloon to pass huge lists in bitmaps.
Will be benefitial for regular balloon operation, as well.

> c. address translation and madvise() operation (24%, 1423ms)

How is this split between translation and madvise?  I suspect it's
mostly madvise since you need translation when using bitmap as well.
Correct? Could you measure this please?  Also, what if we use the new
MADV_FREE instead?  By how much would this help?

Finally, we could teach balloon to skip madvise completely.
By how much would this help?

> Debugging shows the time spends on these operations are listed in the
> brackets above. By changing the VIRTIO_BALLOON_ARRAY_PFNS_MAX to a
> large value, such as 16384, the time spends on sending the PFNs can be
> reduced to about 400ms, but it’s still too long.
> Obviously, the virtio-balloon mechanism has a bigger performance
> impact to the guest than the way we are trying to implement.

Since as we see some of the new interfaces might be
benefitial to balloon as well, I am rather of the opinion that
extending the balloon (basically 3a) might be the right thing to do.

> 3. Virtio interface
> There are three different ways of using the virtio interface to
> send the free page information.
> a. Extend the current virtio device
> The virtio spec has already defined some virtio devices, and we can
> extend one of these devices so as to use it to transport the free page
> information. It requires modifying the virtio spec.

You don't have to do it all by yourself by the way.
Submit the proposal to the oasis virtio tc mailing list,
we will take it from there.

> b. Implement a new virtio device
> Implementing a brand new virtio device to exchange information
> between host and guest is another choice. It requires modifying the
> virtio spec too.
> 
> c. Make use of virtio-serial (Amit’s suggestion, my choice)
> It’s possible to make use the virtio-serial for communication between
> host and guest, the benefit of this solution is no need to modify the
> virtio spec. 
> 
> 4. Construct free page bitmap
> To minimize the space for saving free page information, it’s better to
> use a bitmap to describe the free pages. There are two ways to
> construct the free page bitmap.
> 
> a. Construct free page bitmap when demand (My choice)
> Guest can allocate memory for the free page bitmap only when it
> receives the request from QEMU, and set the free page bitmap by
> traversing the free page list. The advantage of this way is that it’s
> quite simple and easy to implement. The disadvantage is that the
> traversing operation may consume quite a long time when there are a
> lot of free pages. (About 20ms for 7GB free pages)
> 
> b. Update free page bitmap when allocating/freeing pages 
> Another choice is to allocate the memory for the free page bitmap
> when guest boots, and then update the free page bitmap when
> allocating/freeing pages. It needs more modification to the code
> related to memory management in guest. The advantage of this way is
> that guest can response QEMU’s request for a free page bitmap very
> quickly, no matter how many free pages in the guest. Do the kernel guys
> like this?
> 
> 5. Tighten the free page bitmap
> At last, the free page bitmap should be operated with the
> ramlist.dirty_memory to filter out the free pages.
> We should make sure
> the bit N in the free page bitmap and the bit N in the
> ramlist.dirty_memory are corresponding to the same guest’s page. 
> Some arch, like X86, there are ‘holes’ in the memory’s physical
> address, which means there are no actual physical RAM pages
> corresponding to some PFNs. So, some arch specific information is
> needed to construct a proper free page bitmap.
> 
> migration dirty page bitmap:
>     ---------------------
>     |a|b|c|d|e|f|g|h|i|j|
>     ---------------------
> loose free page bitmap:
>     -----------------------------  
>     |a|b|c|d|e|f| | | | |g|h|i|j|
>     -----------------------------
> tight free page bitmap:
>     ---------------------
>     |a|b|c|d|e|f|g|h|i|j|
>     ---------------------
> 
> There are two places for tightening the free page bitmap:
> a. In guest 
> Constructing the free page bitmap in guest requires adding the arch
> related code in guest for building a tight bitmap. The advantage of
> this way is that less memory is needed to store the free page bitmap.
> b. In QEMU (My choice)
> Constructing the free page bitmap in QEMU is more flexible, we can get
> a loose free page bitmap which contains the holes, and then filter out
> the holes in QEMU, the advantage of this way is that we can keep the
> kernel code as simple as we can, the disadvantage is that more memory
> is needed to save the loose free page bitmap. Because this is a mainly
> QEMU feature, if possible, do all the related things in QEMU is
> better.
> 
> 6. Handling page cache in the guest
> The memory used for page cache in the guest will change depends on the
> workload, if guest run some block IO intensive work load, there will
> be lots of pages used for page cache, only a few free pages are left in
> the guest. In order to get more free pages, we can select to ask guest
> to drop some page caches.  Because dropping the page cache may lead to
> performance degradation, only the clean cache should be dropped and we
> should let the user decide whether to do this.
> 
> 7. APIs for live migration
> To make things work, the following APIs should be implemented.
> 
> a. Get memory info of the guest, like this:
> bool get_guest_mem_info(struct guest_mem_info  * info )
> 
> struct guest_mem_info is defined as bellow:
> 
> struct guest_mem_info {
> uint64_t free_pages_num;      // guest’s free pages count 
> uint64_t cached_pages_num;     //total cached pages count
> uint64_t max_pfn;     // the max pfn of the guest
> };
> 
> Return value:
> flase, when QEMU or guest can’t support this operation.
> true, when success.
> 
> b. Request guest’s current free pages information.
> int get_free_page_bmap(unsigned long *bitmap,  bool drop_cache);
> 
> Return value:
> -1, when QEMU or guest can’t support this operation.
> 1, when the free page bitmap is still in the progress of constructing.
> 1, when a valid free page bitmap is ready.
> 
> c. Tighten the free page bitmap
> unsigned long * tighten_free_page_bmap(unsigned long *bitmap);
> 
> This function is an arch specific function to rebuild the loose free
> page bitmap so as to get a tight bitmap which can be operated easily
> with ramlist.dirty_memory.
> 
> 8. Pseudo code 
> Dirty page logging should be enabled before getting the free page
> information from guest, this is important because during the process
> of getting free pages, some free pages may be used and written by the
> guest, dirty page logging can trace these pages. The pseudo code is
> like below:
> 
>     -----------------------------------------------
>     MigrationState *s = migrate_get_current();
>     ...
> 
>     memory_global_dirty_log_start();
> 
>     if (get_guest_mem_info(&info)) {
>         while (!get_free_page_bmap(free_page_bitmap,  drop_page_cache) &&
>                s->state != MIGRATION_STATUS_CANCELLING) {
>             usleep(1000) // sleep for 1 ms
>         }
> 
>         tighten_free_page_bmap = tighten_guest_free_pages(free_page_bitmap);
>         filter_out_guest_free_pages(tighten_free_page_bmap);
>     }
> 
>     migration_bitmap_sync();
>     ...
> 
>     -----------------------------------------------

I don't completely agree with this part.  In my opinion, it should be
asynchronous, depending on getting page lists from guest:

anywhere/periodically:
	...
	request_guest_mem_info
	...

later:

	handle_guest_mem_info()
	{
		address_space_sync_dirty_bitmap
		filter_out_guest_free_pages
	}

as long as we filter with VCPU stopped like this, we can drop the sync
dirty stage, or alternatively we could move filter_out_guest_free_pages
into bh so it happens later while VCPU is running.

This removes any need for waiting.

Introducing delay into migration might still be benefitial
but this way it is optional, we still get part of
the benefit even if we don't wait long enough.

> 
> -- 
> 1.9.1
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html