Re: [RFC Design Doc]Speed up live migration by skipping free pages

Wei Yang <richard.weiyang@xxxxxxxxxx> · Wed, 23 Mar 2016 09:37:15 +0800

Hi, Liang

This is a very clear documentation of your work, I appreciated it a lot. Below
are some of my personal opinion and question.

On Tue, Mar 22, 2016 at 03:43:49PM +0800, Liang Li wrote:
>I have sent the RFC version patch set for live migration optimization
>by skipping processing the free pages in the ram bulk stage and
>received a lot of comments. The related threads can be found at:
>
>https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00715.html
>https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00714.html
>https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00717.html
>https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00716.html
>https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00718.html
>
>https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00719.html 
>https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00720.html
>https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00721.html
>

Actually there are two threads, Qemu thread and kernel thread. It would be
more clear for audience, if you just list two first mail for these two thread
respectively.

>To make things easier, I wrote this doc about the possible designs
>and my choices. Comments are welcome! 
>
>Content
>=======
>1. Background
>2. Why not use virtio-balloon
>3. Virtio interface
>4. Constructing free page bitmap
>5. Tighten free page bitmap
>6. Handling page cache in the guest
>7. APIs for live migration
>8. Pseudo code 
>
>Details
>=======
>1. Background
>As we know, in the ram bulk stage of live migration, current QEMU live
>migration implementation mark the all guest's RAM pages as dirtied in
>the ram bulk stage, all these pages will be checked for zero page
>first, and the page content will be sent to the destination depends on
>the checking result, that process consumes quite a lot of CPU cycles
>and network bandwidth.
>
>>From guest's point of view, there are some pages currently not used by

I see in your original RFC patch and your RFC doc, this line starts with a
character '>'. Not sure this one has a special purpose?

>the guest, guest doesn't care about the content in these pages. Free
>pages are this kind of pages which are not used by guest. We can make
>use of this fact and skip processing the free pages in the ram bulk
>stage, it can save a lot CPU cycles and reduce the network traffic
>while speed up the live migration process obviously.
>
>Usually, only the guest has the information of free pages. But it’s
>possible to let the guest tell QEMU it’s free page information by some
>mechanism. E.g. Through the virtio interface. Once QEMU get the free
>page information, it can skip processing these free pages in the ram
>bulk stage by clearing the corresponding bit of the migration bitmap. 
>
>2. Why not use virtio-balloon 
>Actually, the virtio-balloon can do the similar thing by inflating the
>balloon before live migration, but its performance is no good, for an
>8GB idle guest just boots, it takes about 5.7 Sec to inflate the
>balloon to 7GB, but it only takes 25ms to get a valid free page bitmap
>from the guest.  There are some of reasons for the bad performance of
>vitio-balloon:
>a. allocating pages (5%, 304ms)
>b. sending PFNs to host (71%, 4194ms)
>c. address translation and madvise() operation (24%, 1423ms)
>Debugging shows the time spends on these operations are listed in the
>brackets above. By changing the VIRTIO_BALLOON_ARRAY_PFNS_MAX to a
>large value, such as 16384, the time spends on sending the PFNs can be
>reduced to about 400ms, but it’s still too long.
>
>Obviously, the virtio-balloon mechanism has a bigger performance
>impact to the guest than the way we are trying to implement.
>
>3. Virtio interface
>There are three different ways of using the virtio interface to
>send the free page information.
>a. Extend the current virtio device
>The virtio spec has already defined some virtio devices, and we can
>extend one of these devices so as to use it to transport the free page
>information. It requires modifying the virtio spec.
>
>b. Implement a new virtio device
>Implementing a brand new virtio device to exchange information
>between host and guest is another choice. It requires modifying the
>virtio spec too.
>
>c. Make use of virtio-serial (Amit’s suggestion, my choice)
>It’s possible to make use the virtio-serial for communication between
>host and guest, the benefit of this solution is no need to modify the
>virtio spec. 
>
>4. Construct free page bitmap
>To minimize the space for saving free page information, it’s better to
>use a bitmap to describe the free pages. There are two ways to
>construct the free page bitmap.
>
>a. Construct free page bitmap when demand (My choice)
>Guest can allocate memory for the free page bitmap only when it
>receives the request from QEMU, and set the free page bitmap by
>traversing the free page list. The advantage of this way is that it’s
>quite simple and easy to implement. The disadvantage is that the
>traversing operation may consume quite a long time when there are a
>lot of free pages. (About 20ms for 7GB free pages)
>
>b. Update free page bitmap when allocating/freeing pages 
>Another choice is to allocate the memory for the free page bitmap
>when guest boots, and then update the free page bitmap when
>allocating/freeing pages. It needs more modification to the code
>related to memory management in guest. The advantage of this way is
>that guest can response QEMU’s request for a free page bitmap very
>quickly, no matter how many free pages in the guest. Do the kernel guys
>like this?
>
>5. Tighten the free page bitmap
>At last, the free page bitmap should be operated with the
>ramlist.dirty_memory to filter out the free pages. We should make sure

In exec.c, the variable name is ram_list. If we use the same name in code and
doc, this may be more easy for audience to understand.

>the bit N in the free page bitmap and the bit N in the
>ramlist.dirty_memory are corresponding to the same guest’s page. 
>Some arch, like X86, there are ‘holes’ in the memory’s physical
>address, which means there are no actual physical RAM pages
>corresponding to some PFNs. So, some arch specific information is
>needed to construct a proper free page bitmap.
>
>migration dirty page bitmap:
>    ---------------------
>    |a|b|c|d|e|f|g|h|i|j|
>    ---------------------
>loose free page bitmap:
>    -----------------------------  
>    |a|b|c|d|e|f| | | | |g|h|i|j|
>    -----------------------------
>tight free page bitmap:
>    ---------------------
>    |a|b|c|d|e|f|g|h|i|j|
>    ---------------------
>
>There are two places for tightening the free page bitmap:
>a. In guest 
>Constructing the free page bitmap in guest requires adding the arch
>related code in guest for building a tight bitmap. The advantage of
>this way is that less memory is needed to store the free page bitmap.
>b. In QEMU (My choice)
>Constructing the free page bitmap in QEMU is more flexible, we can get
>a loose free page bitmap which contains the holes, and then filter out
>the holes in QEMU, the advantage of this way is that we can keep the
>kernel code as simple as we can, the disadvantage is that more memory
>is needed to save the loose free page bitmap. Because this is a mainly
>QEMU feature, if possible, do all the related things in QEMU is
>better.
>
>6. Handling page cache in the guest
>The memory used for page cache in the guest will change depends on the
>workload, if guest run some block IO intensive work load, there will

Would this improvement benefit a lot when guest only has little free page?
In your Performance data Case 2, I think it mimic this kind of case. While the
memory consuming task is stopped before migration. If it continues, would we
still perform better than before?

I am thinking is it possible to have a threshold or configurable threshold to
utilize free page bitmap optimization?

>be lots of pages used for page cache, only a few free pages are left in
>the guest. In order to get more free pages, we can select to ask guest
>to drop some page caches.  Because dropping the page cache may lead to
>performance degradation, only the clean cache should be dropped and we
>should let the user decide whether to do this.
>
>7. APIs for live migration
>To make things work, the following APIs should be implemented.
>
>a. Get memory info of the guest, like this:
>bool get_guest_mem_info(struct guest_mem_info  * info )
>
>struct guest_mem_info is defined as bellow:
>
>struct guest_mem_info {
>uint64_t free_pages_num;      // guest’s free pages count 
>uint64_t cached_pages_num;     //total cached pages count
>uint64_t max_pfn;     // the max pfn of the guest
>};
>
>Return value:
>flase, when QEMU or guest can’t support this operation.
>true, when success.
>
>b. Request guest’s current free pages information.
>int get_free_page_bmap(unsigned long *bitmap,  bool drop_cache);
>
>Return value:
>-1, when QEMU or guest can’t support this operation.
>1, when the free page bitmap is still in the progress of constructing.
>1, when a valid free page bitmap is ready.
>
>c. Tighten the free page bitmap
>unsigned long * tighten_free_page_bmap(unsigned long *bitmap);
>
>This function is an arch specific function to rebuild the loose free
>page bitmap so as to get a tight bitmap which can be operated easily
>with ramlist.dirty_memory.
>
>8. Pseudo code 
>Dirty page logging should be enabled before getting the free page
>information from guest, this is important because during the process
>of getting free pages, some free pages may be used and written by the
>guest, dirty page logging can trace these pages. The pseudo code is
>like below:
>
>    -----------------------------------------------
>    MigrationState *s = migrate_get_current();
>    ...
>
>    memory_global_dirty_log_start();
>
>    if (get_guest_mem_info(&info)) {
>        while (!get_free_page_bmap(free_page_bitmap,  drop_page_cache) &&
>               s->state != MIGRATION_STATUS_CANCELLING) {
>            usleep(1000) // sleep for 1 ms
>        }
>
>        tighten_free_page_bmap = tighten_guest_free_pages(free_page_bitmap);
>        filter_out_guest_free_pages(tighten_free_page_bmap);
>    }
>
>    migration_bitmap_sync();
>    ...
>
>    -----------------------------------------------
>
>
>-- 
>1.9.1
>
>--
>To unsubscribe from this list: send the line "unsubscribe kvm" in
>the body of a message to majordomo@xxxxxxxxxxxxxxx
>More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Richard Yang\nHelp you, Help me
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html