RE: [RFC Design Doc]Speed up live migration by skipping free pages

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> > Obviously, the virtio-balloon mechanism has a bigger performance
> > impact to the guest than the way we are trying to implement.
> 
> Yeh, we should separately try and fix that; if it's that slow then people will be
> annoyed about it when they're just using it for balloon.
> 
> > 3. Virtio interface
> > There are three different ways of using the virtio interface to send
> > the free page information.
> > a. Extend the current virtio device
> > The virtio spec has already defined some virtio devices, and we can
> > extend one of these devices so as to use it to transport the free page
> > information. It requires modifying the virtio spec.
> >
> > b. Implement a new virtio device
> > Implementing a brand new virtio device to exchange information between
> > host and guest is another choice. It requires modifying the virtio
> > spec too.
> 
> If the right solution is to change the spec then we should do it; we shouldn't
> use a technically worse solution just to avoid the spec change; although we
> have to be even more careful to get the right solution if we want to change
> the spec.
> 
> > c. Make use of virtio-serial (Amit’s suggestion, my choice) It’s
> > possible to make use the virtio-serial for communication between host
> > and guest, the benefit of this solution is no need to modify the
> > virtio spec.
> >
> > 4. Construct free page bitmap
> > To minimize the space for saving free page information, it’s better to
> > use a bitmap to describe the free pages. There are two ways to
> > construct the free page bitmap.
> >
> > a. Construct free page bitmap when demand (My choice) Guest can
> > allocate memory for the free page bitmap only when it receives the
> > request from QEMU, and set the free page bitmap by traversing the free
> > page list. The advantage of this way is that it’s quite simple and
> > easy to implement. The disadvantage is that the traversing operation
> > may consume quite a long time when there are a lot of free pages.
> > (About 20ms for 7GB free pages)
> 
> I wonder how that scales; 20ms isn't too bad - but I'm more worried about
> what happens when someone does it to the 1TB database VM.

Totally depends on the count of free pages in the VM, if 90% of the memory
in the 1TB VM are free pages, the time is about:
    1024 * 0.9 / 7 *20 = 2633 ms

Is it unbearable? if so, we can use 4b to construct the free page bitmap, hope
the kernel guys can tolerate it.

> > b. Update free page bitmap when allocating/freeing pages Another
> > choice is to allocate the memory for the free page bitmap when guest
> > boots, and then update the free page bitmap when allocating/freeing
> > pages. It needs more modification to the code related to memory
> > management in guest. The advantage of this way is that guest can
> > response QEMU’s request for a free page bitmap very quickly, no matter
> > how many free pages in the guest. Do the kernel guys like this?
> >
> > 5. Tighten the free page bitmap
> > At last, the free page bitmap should be operated with the
> > ramlist.dirty_memory to filter out the free pages. We should make sure
> > the bit N in the free page bitmap and the bit N in the
> > ramlist.dirty_memory are corresponding to the same guest’s page.
> > Some arch, like X86, there are ‘holes’ in the memory’s physical
> > address, which means there are no actual physical RAM pages
> > corresponding to some PFNs. So, some arch specific information is
> > needed to construct a proper free page bitmap.
> >
> > migration dirty page bitmap:
> >     ---------------------
> >     |a|b|c|d|e|f|g|h|i|j|
> >     ---------------------
> > loose free page bitmap:
> >     -----------------------------
> >     |a|b|c|d|e|f| | | | |g|h|i|j|
> >     -----------------------------
> > tight free page bitmap:
> >     ---------------------
> >     |a|b|c|d|e|f|g|h|i|j|
> >     ---------------------
> >
> > There are two places for tightening the free page bitmap:
> > a. In guest
> > Constructing the free page bitmap in guest requires adding the arch
> > related code in guest for building a tight bitmap. The advantage of
> > this way is that less memory is needed to store the free page bitmap.
> > b. In QEMU (My choice)
> > Constructing the free page bitmap in QEMU is more flexible, we can get
> > a loose free page bitmap which contains the holes, and then filter out
> > the holes in QEMU, the advantage of this way is that we can keep the
> > kernel code as simple as we can, the disadvantage is that more memory
> > is needed to save the loose free page bitmap. Because this is a mainly
> > QEMU feature, if possible, do all the related things in QEMU is
> > better.
> 
> Yes, maybe; although we'd have to be careful to validate what the guest fills
> in makes sense.
> 
> > 6. Handling page cache in the guest
> > The memory used for page cache in the guest will change depends on the
> > workload, if guest run some block IO intensive work load, there will
> > be lots of pages used for page cache, only a few free pages are left
> > in the guest. In order to get more free pages, we can select to ask
> > guest to drop some page caches.  Because dropping the page cache may
> > lead to performance degradation, only the clean cache should be
> > dropped and we should let the user decide whether to do this.
> >
> > 7. APIs for live migration
> > To make things work, the following APIs should be implemented.
> >
> > a. Get memory info of the guest, like this:
> > bool get_guest_mem_info(struct guest_mem_info  * info )
> >
> > struct guest_mem_info is defined as bellow:
> >
> > struct guest_mem_info {
> > uint64_t free_pages_num;      // guest’s free pages count
> > uint64_t cached_pages_num;     //total cached pages count
> > uint64_t max_pfn;     // the max pfn of the guest
> > };
> 
> What do you need max_pfn for?
> 

It's used to decide the length of the free page bitmap. I am not sure if we
can get this information in QEMU, if so, it can be removed.

> (We'll also have to think how hotplugged memory works with this).
> Also be careful of how big a page is;  some architectures can choose between
> different guest page sizes (4, 16, 64k I think on ARM), so we just need to
> make sure what unit we're dealing with.  That size is also not necessarily the
> same as the unit size of the migration bitmap; this is always a bit tricky.
> 

Thanks for reminding, I have not thought about the memory hot plug yet. 

> > Return value:
> > flase, when QEMU or guest can’t support this operation.
> > true, when success.
> >
> > b. Request guest’s current free pages information.
> > int get_free_page_bmap(unsigned long *bitmap,  bool drop_cache);
> >
> > Return value:
> > -1, when QEMU or guest can’t support this operation.
> > 1, when the free page bitmap is still in the progress of constructing.
> > 1, when a valid free page bitmap is ready.
> 
> I suggest not using 'long' - I know we do it a lot in QEMU but it's a pain; lets
> nail this down to a uint64_t and then we don't have to worry about what the
> guest is runing.
> 

Will change it.

> > c. Tighten the free page bitmap
> > unsigned long * tighten_free_page_bmap(unsigned long *bitmap);
> >
> > This function is an arch specific function to rebuild the loose free
> > page bitmap so as to get a tight bitmap which can be operated easily
> > with ramlist.dirty_memory.
> 
> I'm not sure you actually need this; as long as what you expect is just a (small)
> series of chunks of bitmap; then you'd just have something like (start at 0... )
> (start at 1MB...) (start at 1GB....)
> 
Yes, if using this kind of bitmap, tighten_free_page_bmap is not needed. Thanks!

> > 8. Pseudo code
> > Dirty page logging should be enabled before getting the free page
> > information from guest, this is important because during the process
> > of getting free pages, some free pages may be used and written by the
> > guest, dirty page logging can trace these pages. The pseudo code is
> > like below:
> >
> >     -----------------------------------------------
> >     MigrationState *s = migrate_get_current();
> >     ...
> >
> >     memory_global_dirty_log_start();
> >
> >     if (get_guest_mem_info(&info)) {
> >         while (!get_free_page_bmap(free_page_bitmap,  drop_page_cache)
> &&
> >                s->state != MIGRATION_STATUS_CANCELLING) {
> >             usleep(1000) // sleep for 1 ms
> >         }
> >
> >         tighten_free_page_bmap =
> tighten_guest_free_pages(free_page_bitmap);
> >         filter_out_guest_free_pages(tighten_free_page_bmap);
> >     }
> 
> Given the typical speed of networks; it wouldn't do too much harm to start
> sending assuming all pages are dirty and then when the guest finally gets
> around to finishing the bitmap then update, so it's asynchronous - and then if
> the guest never responds we don't really care.

Indeed, thanks!

Liang
> 
> Dave
> 
> >
> >     migration_bitmap_sync();
> >     ...
> >
> >     -----------------------------------------------
> >
> >
> > --
> > 1.9.1
> >
> --
> Dr. David Alan Gilbert / dgilbert@xxxxxxxxxx / Manchester, UK
��.n��������+%������w��{.n�����o�^n�r������&��z�ޗ�zf���h���~����������_��+v���)ߣ�

[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux