RE: [RFC Design Doc]Speed up live migration by skipping free pages

"Li, Liang Z" <liang.z.li@xxxxxxxxx> · Wed, 23 Mar 2016 06:05:27 +0000

> > To make things easier, I wrote this doc about the possible designs and
> > my choices. Comments are welcome!
> 
> Thanks for putting this together, and especially for taking the trouble to
> benchmark existing code paths!
> 
> I think these numbers do show that there are gains to be had from merging
> your code with the existing balloon device. It will probably be a bit more work,
> but I think it'll be worth it.
> 
> More comments below.
> 

Thanks for your comments!

> > 2. Why not use virtio-balloon
> > Actually, the virtio-balloon can do the similar thing by inflating the
> > balloon before live migration, but its performance is no good, for an
> > 8GB idle guest just boots, it takes about 5.7 Sec to inflate the
> > balloon to 7GB, but it only takes 25ms to get a valid free page bitmap
> > from the guest.  There are some of reasons for the bad performance of
> > vitio-balloon:
> > a. allocating pages (5%, 304ms)
> 
> Interesting. This is definitely worth improving in guest kernel.
> Also, will it be faster if we allocate and pass to guest huge pages instead?
> Might speed up madvise as well.

Maybe.

> > b. sending PFNs to host (71%, 4194ms)
> 
> OK, so we probably should teach balloon to pass huge lists in bitmaps.
> Will be benefitial for regular balloon operation, as well.
> 

Agree. Current balloon just send 256 PFNs a time, that's too few and lead to too many times 
of virtio transmission, that's the main reason for the bad performance.
Change the VIRTIO_BALLOON_ARRAY_PFNS_MAX to a large value can  improve the
performance significant. Maybe we should increase it before doing the further optimization,
do you think so ?

> > c. address translation and madvise() operation (24%, 1423ms)
> 
> How is this split between translation and madvise?  I suspect it's mostly
> madvise since you need translation when using bitmap as well.
> Correct? Could you measure this please?  Also, what if we use the new
> MADV_FREE instead?  By how much would this help?
> 
For the current balloon, address translation is needed. 
But for live migration, there is no need to do address translation.

I did a another try and got the following data:
   a. allocating pages (6.4%, 402ms)
   b. sending PFNs to host (68.3%, 4263ms)
   c. address translation (6.2%, 389ms)
   d. madvise (19.0%, 1188ms)

The address translation is a time consuming operation too.
I will try MADV_FREE later.

> Finally, we could teach balloon to skip madvise completely.
> By how much would this help?
> 
> > Debugging shows the time spends on these operations are listed in the
> > brackets above. By changing the VIRTIO_BALLOON_ARRAY_PFNS_MAX to
> a
> > large value, such as 16384, the time spends on sending the PFNs can be
> > reduced to about 400ms, but it’s still too long.
> > Obviously, the virtio-balloon mechanism has a bigger performance
> > impact to the guest than the way we are trying to implement.
> 
> Since as we see some of the new interfaces might be benefitial to balloon as
> well, I am rather of the opinion that extending the balloon (basically 3a)
> might be the right thing to do.
> 
> > 3. Virtio interface
> > There are three different ways of using the virtio interface to send
> > the free page information.
> > a. Extend the current virtio device
> > The virtio spec has already defined some virtio devices, and we can
> > extend one of these devices so as to use it to transport the free page
> > information. It requires modifying the virtio spec.
> 
> You don't have to do it all by yourself by the way.
> Submit the proposal to the oasis virtio tc mailing list, we will take it from there.
> 
That's great.

>> 4. Construct free page bitmap
>> To minimize the space for saving free page information, it’s better to 
>> use a bitmap to describe the free pages. There are two ways to 
>> construct the free page bitmap.
>> 
>> a. Construct free page bitmap when demand (My choice) Guest can 
>> allocate memory for the free page bitmap only when it receives the 
>> request from QEMU, and set the free page bitmap by traversing the free 
>> page list. The advantage of this way is that it’s quite simple and 
>> easy to implement. The disadvantage is that the traversing operation 
>> may consume quite a long time when there are a lot of free pages. 
>> (About 20ms for 7GB free pages)
>> 
>> b. Update free page bitmap when allocating/freeing pages Another 
>> choice is to allocate the memory for the free page bitmap when guest 
>>boots, and then update the free page bitmap when allocating/freeing 
>> pages. It needs more modification to the code related to memory 
>>management in guest. The advantage of this way is that guest can 
>> response QEMU’s request for a free page bitmap very quickly, no matter 
>> how many free pages in the guest. Do the kernel guys like this?
>>

> > 8. Pseudo code
> > Dirty page logging should be enabled before getting the free page
> > information from guest, this is important because during the process
> > of getting free pages, some free pages may be used and written by the
> > guest, dirty page logging can trace these pages. The pseudo code is
> > like below:
> >
> >     -----------------------------------------------
> >     MigrationState *s = migrate_get_current();
> >     ...
> >
> >     memory_global_dirty_log_start();
> >
> >     if (get_guest_mem_info(&info)) {
> >         while (!get_free_page_bmap(free_page_bitmap,  drop_page_cache)
> &&
> >                s->state != MIGRATION_STATUS_CANCELLING) {
> >             usleep(1000) // sleep for 1 ms
> >         }
> >
> >         tighten_free_page_bmap =
> tighten_guest_free_pages(free_page_bitmap);
> >         filter_out_guest_free_pages(tighten_free_page_bmap);
> >     }
> >
> >     migration_bitmap_sync();
> >     ...
> >
> >     -----------------------------------------------
> 
> 
> I don't completely agree with this part.  In my opinion, it should be
> asynchronous, depending on getting page lists from guest:
> 
> anywhere/periodically:
> 	...
> 	request_guest_mem_info
> 	...
> 

Periodically? That means filtering out guest free pages not only
in the ram bulk stage, but during the whole process of live migration. right?  
If so, it's better to use 4b to construct the free page bitmap.

> later:
> 
> 
> 	handle_guest_mem_info()
> 	{
> 		address_space_sync_dirty_bitmap
> 		filter_out_guest_free_pages
> 	}
> 
> as long as we filter with VCPU stopped like this, we can drop the sync dirty
> stage, or alternatively we could move filter_out_guest_free_pages into bh
> so it happens later while VCPU is running.
> 
> This removes any need for waiting.
> 
> 
> Introducing delay into migration might still be benefitial but this way it is
> optional, we still get part of the benefit even if we don't wait long enough.
> 

Yes, I agree asynchronous mode is better and I will change it.
 From the perspective of saving resources(CPU and network bandwidth), waiting is not so bad. :)

Liang
> 
> >
> > --
> > 1.9.1
��.n��������+%������w��{.n�����o�^n�r������&��z�ޗ�zf���h���~����������_��+v���)ߣ�