I have sent the RFC version patch set for live migration optimization by skipping processing the free pages in the ram bulk stage and received a lot of comments. The related threads can be found at: https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00715.html https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00714.html https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00717.html https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00716.html https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00718.html https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00719.html https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00720.html https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00721.html To make things easier, I wrote this doc about the possible designs and my choices. Comments are welcome! Content ======= 1. Background 2. Why not use virtio-balloon 3. Virtio interface 4. Constructing free page bitmap 5. Tighten free page bitmap 6. Handling page cache in the guest 7. APIs for live migration 8. Pseudo code Details ======= 1. Background As we know, in the ram bulk stage of live migration, current QEMU live migration implementation mark the all guest's RAM pages as dirtied in the ram bulk stage, all these pages will be checked for zero page first, and the page content will be sent to the destination depends on the checking result, that process consumes quite a lot of CPU cycles and network bandwidth. >From guest's point of view, there are some pages currently not used by the guest, guest doesn't care about the content in these pages. Free pages are this kind of pages which are not used by guest. We can make use of this fact and skip processing the free pages in the ram bulk stage, it can save a lot CPU cycles and reduce the network traffic while speed up the live migration process obviously. Usually, only the guest has the information of free pages. But it’s possible to let the guest tell QEMU it’s free page information by some mechanism. E.g. Through the virtio interface. Once QEMU get the free page information, it can skip processing these free pages in the ram bulk stage by clearing the corresponding bit of the migration bitmap. 2. Why not use virtio-balloon Actually, the virtio-balloon can do the similar thing by inflating the balloon before live migration, but its performance is no good, for an 8GB idle guest just boots, it takes about 5.7 Sec to inflate the balloon to 7GB, but it only takes 25ms to get a valid free page bitmap from the guest. There are some of reasons for the bad performance of vitio-balloon: a. allocating pages (5%, 304ms) b. sending PFNs to host (71%, 4194ms) c. address translation and madvise() operation (24%, 1423ms) Debugging shows the time spends on these operations are listed in the brackets above. By changing the VIRTIO_BALLOON_ARRAY_PFNS_MAX to a large value, such as 16384, the time spends on sending the PFNs can be reduced to about 400ms, but it’s still too long. Obviously, the virtio-balloon mechanism has a bigger performance impact to the guest than the way we are trying to implement. 3. Virtio interface There are three different ways of using the virtio interface to send the free page information. a. Extend the current virtio device The virtio spec has already defined some virtio devices, and we can extend one of these devices so as to use it to transport the free page information. It requires modifying the virtio spec. b. Implement a new virtio device Implementing a brand new virtio device to exchange information between host and guest is another choice. It requires modifying the virtio spec too. c. Make use of virtio-serial (Amit’s suggestion, my choice) It’s possible to make use the virtio-serial for communication between host and guest, the benefit of this solution is no need to modify the virtio spec. 4. Construct free page bitmap To minimize the space for saving free page information, it’s better to use a bitmap to describe the free pages. There are two ways to construct the free page bitmap. a. Construct free page bitmap when demand (My choice) Guest can allocate memory for the free page bitmap only when it receives the request from QEMU, and set the free page bitmap by traversing the free page list. The advantage of this way is that it’s quite simple and easy to implement. The disadvantage is that the traversing operation may consume quite a long time when there are a lot of free pages. (About 20ms for 7GB free pages) b. Update free page bitmap when allocating/freeing pages Another choice is to allocate the memory for the free page bitmap when guest boots, and then update the free page bitmap when allocating/freeing pages. It needs more modification to the code related to memory management in guest. The advantage of this way is that guest can response QEMU’s request for a free page bitmap very quickly, no matter how many free pages in the guest. Do the kernel guys like this? 5. Tighten the free page bitmap At last, the free page bitmap should be operated with the ramlist.dirty_memory to filter out the free pages. We should make sure the bit N in the free page bitmap and the bit N in the ramlist.dirty_memory are corresponding to the same guest’s page. Some arch, like X86, there are ‘holes’ in the memory’s physical address, which means there are no actual physical RAM pages corresponding to some PFNs. So, some arch specific information is needed to construct a proper free page bitmap. migration dirty page bitmap: --------------------- |a|b|c|d|e|f|g|h|i|j| --------------------- loose free page bitmap: ----------------------------- |a|b|c|d|e|f| | | | |g|h|i|j| ----------------------------- tight free page bitmap: --------------------- |a|b|c|d|e|f|g|h|i|j| --------------------- There are two places for tightening the free page bitmap: a. In guest Constructing the free page bitmap in guest requires adding the arch related code in guest for building a tight bitmap. The advantage of this way is that less memory is needed to store the free page bitmap. b. In QEMU (My choice) Constructing the free page bitmap in QEMU is more flexible, we can get a loose free page bitmap which contains the holes, and then filter out the holes in QEMU, the advantage of this way is that we can keep the kernel code as simple as we can, the disadvantage is that more memory is needed to save the loose free page bitmap. Because this is a mainly QEMU feature, if possible, do all the related things in QEMU is better. 6. Handling page cache in the guest The memory used for page cache in the guest will change depends on the workload, if guest run some block IO intensive work load, there will be lots of pages used for page cache, only a few free pages are left in the guest. In order to get more free pages, we can select to ask guest to drop some page caches. Because dropping the page cache may lead to performance degradation, only the clean cache should be dropped and we should let the user decide whether to do this. 7. APIs for live migration To make things work, the following APIs should be implemented. a. Get memory info of the guest, like this: bool get_guest_mem_info(struct guest_mem_info * info ) struct guest_mem_info is defined as bellow: struct guest_mem_info { uint64_t free_pages_num; // guest’s free pages count uint64_t cached_pages_num; //total cached pages count uint64_t max_pfn; // the max pfn of the guest }; Return value: flase, when QEMU or guest can’t support this operation. true, when success. b. Request guest’s current free pages information. int get_free_page_bmap(unsigned long *bitmap, bool drop_cache); Return value: -1, when QEMU or guest can’t support this operation. 1, when the free page bitmap is still in the progress of constructing. 1, when a valid free page bitmap is ready. c. Tighten the free page bitmap unsigned long * tighten_free_page_bmap(unsigned long *bitmap); This function is an arch specific function to rebuild the loose free page bitmap so as to get a tight bitmap which can be operated easily with ramlist.dirty_memory. 8. Pseudo code Dirty page logging should be enabled before getting the free page information from guest, this is important because during the process of getting free pages, some free pages may be used and written by the guest, dirty page logging can trace these pages. The pseudo code is like below: ----------------------------------------------- MigrationState *s = migrate_get_current(); ... memory_global_dirty_log_start(); if (get_guest_mem_info(&info)) { while (!get_free_page_bmap(free_page_bitmap, drop_page_cache) && s->state != MIGRATION_STATUS_CANCELLING) { usleep(1000) // sleep for 1 ms } tighten_free_page_bmap = tighten_guest_free_pages(free_page_bitmap); filter_out_guest_free_pages(tighten_free_page_bmap); } migration_bitmap_sync(); ... ----------------------------------------------- -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html