Hi, Liang This is a very clear documentation of your work, I appreciated it a lot. Below are some of my personal opinion and question. On Tue, Mar 22, 2016 at 03:43:49PM +0800, Liang Li wrote: >I have sent the RFC version patch set for live migration optimization >by skipping processing the free pages in the ram bulk stage and >received a lot of comments. The related threads can be found at: > >https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00715.html >https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00714.html >https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00717.html >https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00716.html >https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00718.html > >https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00719.html >https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00720.html >https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00721.html > Actually there are two threads, Qemu thread and kernel thread. It would be more clear for audience, if you just list two first mail for these two thread respectively. >To make things easier, I wrote this doc about the possible designs >and my choices. Comments are welcome! > >Content >======= >1. Background >2. Why not use virtio-balloon >3. Virtio interface >4. Constructing free page bitmap >5. Tighten free page bitmap >6. Handling page cache in the guest >7. APIs for live migration >8. Pseudo code > >Details >======= >1. Background >As we know, in the ram bulk stage of live migration, current QEMU live >migration implementation mark the all guest's RAM pages as dirtied in >the ram bulk stage, all these pages will be checked for zero page >first, and the page content will be sent to the destination depends on >the checking result, that process consumes quite a lot of CPU cycles >and network bandwidth. > >>From guest's point of view, there are some pages currently not used by I see in your original RFC patch and your RFC doc, this line starts with a character '>'. Not sure this one has a special purpose? >the guest, guest doesn't care about the content in these pages. Free >pages are this kind of pages which are not used by guest. We can make >use of this fact and skip processing the free pages in the ram bulk >stage, it can save a lot CPU cycles and reduce the network traffic >while speed up the live migration process obviously. > >Usually, only the guest has the information of free pages. But it’s >possible to let the guest tell QEMU it’s free page information by some >mechanism. E.g. Through the virtio interface. Once QEMU get the free >page information, it can skip processing these free pages in the ram >bulk stage by clearing the corresponding bit of the migration bitmap. > >2. Why not use virtio-balloon >Actually, the virtio-balloon can do the similar thing by inflating the >balloon before live migration, but its performance is no good, for an >8GB idle guest just boots, it takes about 5.7 Sec to inflate the >balloon to 7GB, but it only takes 25ms to get a valid free page bitmap >from the guest. There are some of reasons for the bad performance of >vitio-balloon: >a. allocating pages (5%, 304ms) >b. sending PFNs to host (71%, 4194ms) >c. address translation and madvise() operation (24%, 1423ms) >Debugging shows the time spends on these operations are listed in the >brackets above. By changing the VIRTIO_BALLOON_ARRAY_PFNS_MAX to a >large value, such as 16384, the time spends on sending the PFNs can be >reduced to about 400ms, but it’s still too long. > >Obviously, the virtio-balloon mechanism has a bigger performance >impact to the guest than the way we are trying to implement. > >3. Virtio interface >There are three different ways of using the virtio interface to >send the free page information. >a. Extend the current virtio device >The virtio spec has already defined some virtio devices, and we can >extend one of these devices so as to use it to transport the free page >information. It requires modifying the virtio spec. > >b. Implement a new virtio device >Implementing a brand new virtio device to exchange information >between host and guest is another choice. It requires modifying the >virtio spec too. > >c. Make use of virtio-serial (Amit’s suggestion, my choice) >It’s possible to make use the virtio-serial for communication between >host and guest, the benefit of this solution is no need to modify the >virtio spec. > >4. Construct free page bitmap >To minimize the space for saving free page information, it’s better to >use a bitmap to describe the free pages. There are two ways to >construct the free page bitmap. > >a. Construct free page bitmap when demand (My choice) >Guest can allocate memory for the free page bitmap only when it >receives the request from QEMU, and set the free page bitmap by >traversing the free page list. The advantage of this way is that it’s >quite simple and easy to implement. The disadvantage is that the >traversing operation may consume quite a long time when there are a >lot of free pages. (About 20ms for 7GB free pages) > >b. Update free page bitmap when allocating/freeing pages >Another choice is to allocate the memory for the free page bitmap >when guest boots, and then update the free page bitmap when >allocating/freeing pages. It needs more modification to the code >related to memory management in guest. The advantage of this way is >that guest can response QEMU’s request for a free page bitmap very >quickly, no matter how many free pages in the guest. Do the kernel guys >like this? > >5. Tighten the free page bitmap >At last, the free page bitmap should be operated with the >ramlist.dirty_memory to filter out the free pages. We should make sure In exec.c, the variable name is ram_list. If we use the same name in code and doc, this may be more easy for audience to understand. >the bit N in the free page bitmap and the bit N in the >ramlist.dirty_memory are corresponding to the same guest’s page. >Some arch, like X86, there are ‘holes’ in the memory’s physical >address, which means there are no actual physical RAM pages >corresponding to some PFNs. So, some arch specific information is >needed to construct a proper free page bitmap. > >migration dirty page bitmap: > --------------------- > |a|b|c|d|e|f|g|h|i|j| > --------------------- >loose free page bitmap: > ----------------------------- > |a|b|c|d|e|f| | | | |g|h|i|j| > ----------------------------- >tight free page bitmap: > --------------------- > |a|b|c|d|e|f|g|h|i|j| > --------------------- > >There are two places for tightening the free page bitmap: >a. In guest >Constructing the free page bitmap in guest requires adding the arch >related code in guest for building a tight bitmap. The advantage of >this way is that less memory is needed to store the free page bitmap. >b. In QEMU (My choice) >Constructing the free page bitmap in QEMU is more flexible, we can get >a loose free page bitmap which contains the holes, and then filter out >the holes in QEMU, the advantage of this way is that we can keep the >kernel code as simple as we can, the disadvantage is that more memory >is needed to save the loose free page bitmap. Because this is a mainly >QEMU feature, if possible, do all the related things in QEMU is >better. > >6. Handling page cache in the guest >The memory used for page cache in the guest will change depends on the >workload, if guest run some block IO intensive work load, there will Would this improvement benefit a lot when guest only has little free page? In your Performance data Case 2, I think it mimic this kind of case. While the memory consuming task is stopped before migration. If it continues, would we still perform better than before? I am thinking is it possible to have a threshold or configurable threshold to utilize free page bitmap optimization? >be lots of pages used for page cache, only a few free pages are left in >the guest. In order to get more free pages, we can select to ask guest >to drop some page caches. Because dropping the page cache may lead to >performance degradation, only the clean cache should be dropped and we >should let the user decide whether to do this. > >7. APIs for live migration >To make things work, the following APIs should be implemented. > >a. Get memory info of the guest, like this: >bool get_guest_mem_info(struct guest_mem_info * info ) > >struct guest_mem_info is defined as bellow: > >struct guest_mem_info { >uint64_t free_pages_num; // guest’s free pages count >uint64_t cached_pages_num; //total cached pages count >uint64_t max_pfn; // the max pfn of the guest >}; > >Return value: >flase, when QEMU or guest can’t support this operation. >true, when success. > >b. Request guest’s current free pages information. >int get_free_page_bmap(unsigned long *bitmap, bool drop_cache); > >Return value: >-1, when QEMU or guest can’t support this operation. >1, when the free page bitmap is still in the progress of constructing. >1, when a valid free page bitmap is ready. > >c. Tighten the free page bitmap >unsigned long * tighten_free_page_bmap(unsigned long *bitmap); > >This function is an arch specific function to rebuild the loose free >page bitmap so as to get a tight bitmap which can be operated easily >with ramlist.dirty_memory. > >8. Pseudo code >Dirty page logging should be enabled before getting the free page >information from guest, this is important because during the process >of getting free pages, some free pages may be used and written by the >guest, dirty page logging can trace these pages. The pseudo code is >like below: > > ----------------------------------------------- > MigrationState *s = migrate_get_current(); > ... > > memory_global_dirty_log_start(); > > if (get_guest_mem_info(&info)) { > while (!get_free_page_bmap(free_page_bitmap, drop_page_cache) && > s->state != MIGRATION_STATUS_CANCELLING) { > usleep(1000) // sleep for 1 ms > } > > tighten_free_page_bmap = tighten_guest_free_pages(free_page_bitmap); > filter_out_guest_free_pages(tighten_free_page_bmap); > } > > migration_bitmap_sync(); > ... > > ----------------------------------------------- > > >-- >1.9.1 > >-- >To unsubscribe from this list: send the line "unsubscribe kvm" in >the body of a message to majordomo@xxxxxxxxxxxxxxx >More majordomo info at http://vger.kernel.org/majordomo-info.html -- Richard Yang\nHelp you, Help me -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html