> On 10 May 2022, at 20:38, Daniel P. Berrangé <berrange@xxxxxxxxxx> wrote: > > On Sat, May 07, 2022 at 03:42:53PM +0200, Claudio Fontana wrote: >> This is v8 of the multifd save prototype, which fixes a few bugs, >> adds a few more code splits, and records the number of channels >> as well as the compression algorithm, so the restore command is >> more user-friendly. >> >> It is now possible to just say: >> >> virsh save mydomain /mnt/saves/mysave --parallel >> >> virsh restore /mnt/saves/mysave --parallel >> >> and things work with the default of 2 channels, no compression. >> >> It is also possible to say of course: >> >> virsh save mydomain /mnt/saves/mysave --parallel >> --parallel-connections 16 --parallel-compression zstd >> >> virsh restore /mnt/saves/mysave --parallel >> >> and things also work fine, due to channels and compression >> being stored in the main save file. > > For the sake of people following along, the above commands will > result in creation of multiple files > > /mnt/saves/mysave > /mnt/saves/mysave.0 > /mnt/saves/mysave.1 > .... > /mnt/saves/mysave.n > > Where 'n' is the number of threads used. > > Overall I'm not very happy with the approach of doing any of this > on the libvirt side. > > Backing up, we know that QEMU can directly save to disk faster than > libvirt can. We mitigated alot of that overhead with previous patches > to increase the pipe buffer size, but some still remains due to the > extra copies inherant in handing this off to libvirt. > > Using multifd on the libvirt side, IIUC, gets us better performance > than QEMU can manage if doing non-multifd write to file directly, > but we still have the extra copies in there due to the hand off > to libvirt. If QEMU were to be directly capable to writing to > disk with multifd, it should beat us again. > > As a result of how we integrate with QEMU multifd, we're taking the > approach of saving the state across multiple files, because it is > easier than trying to get multiple threads writing to the same file. > It could be solved by using file range locking on the save file. > eg a thread can reserve say 500 MB of space, fill it up, and then > reserve another 500 MB, etc, etc. It is a bit tedious though and > won't align nicely. eg a 1 GB huge page, would be 1 GB + a few > bytes of QEMU RAM ave state header. First, I do not understand why you would write things that are not page-aligned to start with? (As an aside, I don’t know how any dirty tracking would work if you do not keep things page-aligned). Could uffd_register_memory accept a memory range that is not aligned? If so, when? Should that be specified in the interface? Second, instead of creating multiple files, why not write blocks at a location determined by an variable that you increment using atomic operations each time you need a new block? If you want to keep the blocks page-aligned in the file as well (which might help if you want to mmap the file at some point), then you need to build a map of the blocks that you tack at the end of the file. There may be good reasons not to do it that way, of course, but I am not familiar enough with the problem to know them. > > The other downside of multiple files is that it complicates life > for both libvirt and apps using libvirt. They need to be aware of > multiple files and move them around together. This is not a simple > as it might sound. For example, IIRC OpenStack would upload a save > image state into a glance bucket for later use. Well, now it needs > multiple distinct buckets and keep track of them all. It also means > we're forced to use the same concurrency level when restoring, which > is not neccessarily desirable if the host environment is different > when restoring. ie The original host might have had 8 CPUs, but the > new host might only have 4 available, or vica-verca. > > > I know it is appealing to do something on the libvirt side, because > it is quicker than getting an enhancement into new QEMU release. We > have been down this route before with the migration support in libvirt > in the past though, when we introduced the tunnelled live migration > in order to workaround QEMU's inability to do TLS encryption. I very > much regret that we ever did this, because tunnelled migration was > inherantly limited, so for example failed to work with multifd, > and failed to work with NBD based disk migration. In the end I did > what I should have done at the beginning and just added TLS support > to QEMU, making tunnelled migration obsolete, except we still have > to carry the code around in libvirt indefinitely due to apps using > it. > > So I'm very concerned about not having history repeat itself and > give us a long term burden for a solution that turns out to be a > evolutionary dead end. > > I like the idea of parallel saving, but I really think we need to > implement this directly in QEMU, not libvirt. As previously > mentioned I think QEMU needs to get a 'file' migration protocol, > along with ability to directly map RAM segments into fixed > positions in the file. The benefits are many > > - It will save & restore faster because we're eliminating data > copies that libvirt imposes via the iohelper > > - It is simple for libvirt & mgmt apps as we still only > have one file to manage > > - It is space efficient because if a guest dirties a > memory page, we just overwrite the existing contents > at the fixed location in the file, instead of appending > new contents to the file > > - It will restore faster too because we only restore > each memory page once, due to always overwriting the > file in-place when the guest dirtied a page during save > > - It can save and restore with differing number of threads, > and can even dynamically change the number of threads > in the middle of the save/restore operation > > As David G has pointed out the impl is not trivial on the QEMU > side, but from what I understand of the migration code, it is > certainly viable. Most importantly I think it puts us in a > better position for long term feature enhancements later by > taking the middle man (libvirt) out of the equation, letting > QEMU directly know what medium it is saving/restoring to/from. > > > With regards, > Daniel > -- > |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| > |: https://libvirt.org -o- https://fstop138.berrange.com :| > |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :| >