Re: [libvirt RFCv8 00/27] multifd save restore prototype

Christophe Marie Francois Dupont de Dinechin <cdupontd@xxxxxxxxxx> · Wed, 11 May 2022 10:27:30 +0200

> On 10 May 2022, at 20:38, Daniel P. Berrangé <berrange@xxxxxxxxxx> wrote:
> 
> On Sat, May 07, 2022 at 03:42:53PM +0200, Claudio Fontana wrote:
>> This is v8 of the multifd save prototype, which fixes a few bugs,
>> adds a few more code splits, and records the number of channels
>> as well as the compression algorithm, so the restore command is
>> more user-friendly.
>> 
>> It is now possible to just say:
>> 
>> virsh save mydomain /mnt/saves/mysave --parallel
>> 
>> virsh restore /mnt/saves/mysave --parallel
>> 
>> and things work with the default of 2 channels, no compression.
>> 
>> It is also possible to say of course:
>> 
>> virsh save mydomain /mnt/saves/mysave --parallel
>>      --parallel-connections 16 --parallel-compression zstd
>> 
>> virsh restore /mnt/saves/mysave --parallel
>> 
>> and things also work fine, due to channels and compression
>> being stored in the main save file.
> 
> For the sake of people following along, the above commands will
> result in creation of multiple files
> 
>  /mnt/saves/mysave
>  /mnt/saves/mysave.0
>  /mnt/saves/mysave.1
>  ....
>  /mnt/saves/mysave.n
> 
> Where 'n' is the number of threads used.
> 
> Overall I'm not very happy with the approach of doing any of this
> on the libvirt side.
> 
> Backing up, we know that QEMU can directly save to disk faster than
> libvirt can. We mitigated alot of that overhead with previous patches
> to increase the pipe buffer size, but some still remains due to the
> extra copies inherant in handing this off to libvirt.
> 
> Using multifd on the libvirt side, IIUC, gets us better performance
> than QEMU can manage if doing non-multifd write to file directly,
> but we still have the extra copies in there due to the hand off
> to libvirt. If QEMU were to be directly capable to writing to
> disk with multifd, it should beat us again.
> 
> As a result of how we integrate with QEMU multifd, we're taking the
> approach of saving the state across multiple files, because it is
> easier than trying to get multiple threads writing to the same file.
> It could be solved by using file range locking on the save file.
> eg a thread can reserve say 500 MB of space, fill it up, and then
> reserve another 500 MB, etc, etc. It is a bit tedious though and
> won't align nicely. eg a 1 GB huge page, would be 1 GB + a few
> bytes of QEMU RAM ave state header.

First, I do not understand why you would write things that are
not page-aligned to start with? (As an aside, I don’t know
how any dirty tracking would work if you do not keep things
page-aligned).

Could uffd_register_memory accept a memory range that is
not aligned? If so, when? Should that be specified in the
interface?

Second, instead of creating multiple files, why not write blocks
at a location determined by an variable that you increment using
atomic operations each time you need a new block? If you want to
keep the blocks page-aligned in the file as well (which might help
if you want to mmap the file at some point), then you need to
build a map of the blocks that you tack at the end of the file.

There may be good reasons not to do it that way, of course, but
I am not familiar enough with the problem to know them.

> 
> The other downside of multiple files is that it complicates life
> for both libvirt and apps using libvirt. They need to be aware of
> multiple files and move them around together. This is not a simple
> as it might sound. For example, IIRC OpenStack would upload a save
> image state into a glance bucket for later use. Well, now it needs
> multiple distinct buckets and keep track of them all. It also means
> we're forced to use the same concurrency level when restoring, which
> is not neccessarily desirable if the host environment is different
> when restoring. ie The original host might have had 8 CPUs, but the
> new host might only have 4 available, or vica-verca.
> 
> 
> I know it is appealing to do something on the libvirt side, because
> it is quicker than getting an enhancement into new QEMU release. We
> have been down this route before with the migration support in libvirt
> in the past though, when we introduced the tunnelled live migration
> in order to workaround QEMU's inability to do TLS encryption. I very
> much regret that we ever did this, because tunnelled migration was
> inherantly limited, so for example failed to work with multifd,
> and failed to work with NBD based disk migration. In the end I did
> what I should have done at the beginning and just added TLS support
> to QEMU, making tunnelled migration obsolete, except we still have
> to carry the code around in libvirt indefinitely due to apps using
> it.
> 
> So I'm very concerned about not having history repeat itself and
> give us a long term burden for  a solution that turns out to be a
> evolutionary dead end.
> 
> I like the idea of parallel saving, but I really think we need to
> implement this directly in QEMU, not libvirt. As previously
> mentioned I think QEMU needs to get a 'file' migration protocol,
> along with ability to directly map RAM  segments into fixed
> positions in the file. The benefits are many
> 
> - It will save & restore faster because we're eliminating data
>   copies that libvirt imposes via the iohelper
> 
> - It is simple for libvirt & mgmt apps as we still only
>   have one file to manage
> 
> - It is space efficient because if a guest dirties a
>   memory page, we just overwrite the existing contents
>   at the fixed location in the file, instead of appending
>   new contents to the file
> 
> - It will restore faster too because we only restore
>   each memory page once, due to always overwriting the
>   file in-place when the guest dirtied a page during save
> 
> - It can save and restore with differing number of threads,
>   and can even dynamically change the number of threads
>   in the middle of the save/restore operation 
> 
> As David G has pointed out the impl is not trivial on the QEMU
> side, but from what I understand of the migration code, it is
> certainly viable. Most importantly I think it puts us in a
> better position for long term feature enhancements later by
> taking the middle man (libvirt) out of the equation, letting
> QEMU directly know what medium it is saving/restoring to/from.
> 
> 
> With regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
>