> > Obviously, the virtio-balloon mechanism has a bigger performance > > impact to the guest than the way we are trying to implement. > > Yeh, we should separately try and fix that; if it's that slow then people will be > annoyed about it when they're just using it for balloon. > > > 3. Virtio interface > > There are three different ways of using the virtio interface to send > > the free page information. > > a. Extend the current virtio device > > The virtio spec has already defined some virtio devices, and we can > > extend one of these devices so as to use it to transport the free page > > information. It requires modifying the virtio spec. > > > > b. Implement a new virtio device > > Implementing a brand new virtio device to exchange information between > > host and guest is another choice. It requires modifying the virtio > > spec too. > > If the right solution is to change the spec then we should do it; we shouldn't > use a technically worse solution just to avoid the spec change; although we > have to be even more careful to get the right solution if we want to change > the spec. > > > c. Make use of virtio-serial (Amit’s suggestion, my choice) It’s > > possible to make use the virtio-serial for communication between host > > and guest, the benefit of this solution is no need to modify the > > virtio spec. > > > > 4. Construct free page bitmap > > To minimize the space for saving free page information, it’s better to > > use a bitmap to describe the free pages. There are two ways to > > construct the free page bitmap. > > > > a. Construct free page bitmap when demand (My choice) Guest can > > allocate memory for the free page bitmap only when it receives the > > request from QEMU, and set the free page bitmap by traversing the free > > page list. The advantage of this way is that it’s quite simple and > > easy to implement. The disadvantage is that the traversing operation > > may consume quite a long time when there are a lot of free pages. > > (About 20ms for 7GB free pages) > > I wonder how that scales; 20ms isn't too bad - but I'm more worried about > what happens when someone does it to the 1TB database VM. Totally depends on the count of free pages in the VM, if 90% of the memory in the 1TB VM are free pages, the time is about: 1024 * 0.9 / 7 *20 = 2633 ms Is it unbearable? if so, we can use 4b to construct the free page bitmap, hope the kernel guys can tolerate it. > > b. Update free page bitmap when allocating/freeing pages Another > > choice is to allocate the memory for the free page bitmap when guest > > boots, and then update the free page bitmap when allocating/freeing > > pages. It needs more modification to the code related to memory > > management in guest. The advantage of this way is that guest can > > response QEMU’s request for a free page bitmap very quickly, no matter > > how many free pages in the guest. Do the kernel guys like this? > > > > 5. Tighten the free page bitmap > > At last, the free page bitmap should be operated with the > > ramlist.dirty_memory to filter out the free pages. We should make sure > > the bit N in the free page bitmap and the bit N in the > > ramlist.dirty_memory are corresponding to the same guest’s page. > > Some arch, like X86, there are ‘holes’ in the memory’s physical > > address, which means there are no actual physical RAM pages > > corresponding to some PFNs. So, some arch specific information is > > needed to construct a proper free page bitmap. > > > > migration dirty page bitmap: > > --------------------- > > |a|b|c|d|e|f|g|h|i|j| > > --------------------- > > loose free page bitmap: > > ----------------------------- > > |a|b|c|d|e|f| | | | |g|h|i|j| > > ----------------------------- > > tight free page bitmap: > > --------------------- > > |a|b|c|d|e|f|g|h|i|j| > > --------------------- > > > > There are two places for tightening the free page bitmap: > > a. In guest > > Constructing the free page bitmap in guest requires adding the arch > > related code in guest for building a tight bitmap. The advantage of > > this way is that less memory is needed to store the free page bitmap. > > b. In QEMU (My choice) > > Constructing the free page bitmap in QEMU is more flexible, we can get > > a loose free page bitmap which contains the holes, and then filter out > > the holes in QEMU, the advantage of this way is that we can keep the > > kernel code as simple as we can, the disadvantage is that more memory > > is needed to save the loose free page bitmap. Because this is a mainly > > QEMU feature, if possible, do all the related things in QEMU is > > better. > > Yes, maybe; although we'd have to be careful to validate what the guest fills > in makes sense. > > > 6. Handling page cache in the guest > > The memory used for page cache in the guest will change depends on the > > workload, if guest run some block IO intensive work load, there will > > be lots of pages used for page cache, only a few free pages are left > > in the guest. In order to get more free pages, we can select to ask > > guest to drop some page caches. Because dropping the page cache may > > lead to performance degradation, only the clean cache should be > > dropped and we should let the user decide whether to do this. > > > > 7. APIs for live migration > > To make things work, the following APIs should be implemented. > > > > a. Get memory info of the guest, like this: > > bool get_guest_mem_info(struct guest_mem_info * info ) > > > > struct guest_mem_info is defined as bellow: > > > > struct guest_mem_info { > > uint64_t free_pages_num; // guest’s free pages count > > uint64_t cached_pages_num; //total cached pages count > > uint64_t max_pfn; // the max pfn of the guest > > }; > > What do you need max_pfn for? > It's used to decide the length of the free page bitmap. I am not sure if we can get this information in QEMU, if so, it can be removed. > (We'll also have to think how hotplugged memory works with this). > Also be careful of how big a page is; some architectures can choose between > different guest page sizes (4, 16, 64k I think on ARM), so we just need to > make sure what unit we're dealing with. That size is also not necessarily the > same as the unit size of the migration bitmap; this is always a bit tricky. > Thanks for reminding, I have not thought about the memory hot plug yet. > > Return value: > > flase, when QEMU or guest can’t support this operation. > > true, when success. > > > > b. Request guest’s current free pages information. > > int get_free_page_bmap(unsigned long *bitmap, bool drop_cache); > > > > Return value: > > -1, when QEMU or guest can’t support this operation. > > 1, when the free page bitmap is still in the progress of constructing. > > 1, when a valid free page bitmap is ready. > > I suggest not using 'long' - I know we do it a lot in QEMU but it's a pain; lets > nail this down to a uint64_t and then we don't have to worry about what the > guest is runing. > Will change it. > > c. Tighten the free page bitmap > > unsigned long * tighten_free_page_bmap(unsigned long *bitmap); > > > > This function is an arch specific function to rebuild the loose free > > page bitmap so as to get a tight bitmap which can be operated easily > > with ramlist.dirty_memory. > > I'm not sure you actually need this; as long as what you expect is just a (small) > series of chunks of bitmap; then you'd just have something like (start at 0... ) > (start at 1MB...) (start at 1GB....) > Yes, if using this kind of bitmap, tighten_free_page_bmap is not needed. Thanks! > > 8. Pseudo code > > Dirty page logging should be enabled before getting the free page > > information from guest, this is important because during the process > > of getting free pages, some free pages may be used and written by the > > guest, dirty page logging can trace these pages. The pseudo code is > > like below: > > > > ----------------------------------------------- > > MigrationState *s = migrate_get_current(); > > ... > > > > memory_global_dirty_log_start(); > > > > if (get_guest_mem_info(&info)) { > > while (!get_free_page_bmap(free_page_bitmap, drop_page_cache) > && > > s->state != MIGRATION_STATUS_CANCELLING) { > > usleep(1000) // sleep for 1 ms > > } > > > > tighten_free_page_bmap = > tighten_guest_free_pages(free_page_bitmap); > > filter_out_guest_free_pages(tighten_free_page_bmap); > > } > > Given the typical speed of networks; it wouldn't do too much harm to start > sending assuming all pages are dirty and then when the guest finally gets > around to finishing the bitmap then update, so it's asynchronous - and then if > the guest never responds we don't really care. Indeed, thanks! Liang > > Dave > > > > > migration_bitmap_sync(); > > ... > > > > ----------------------------------------------- > > > > > > -- > > 1.9.1 > > > -- > Dr. David Alan Gilbert / dgilbert@xxxxxxxxxx / Manchester, UK ��.n��������+%������w��{.n�����o�^n�r������&��z�ޗ�zf���h���~����������_��+v���)ߣ�