On Wed, May 11, 2022 at 01:47:13PM +0200, Claudio Fontana wrote: > On 5/11/22 10:27 AM, Christophe Marie Francois Dupont de Dinechin wrote: > > > > > >> On 10 May 2022, at 20:38, Daniel P. Berrangé <berrange@xxxxxxxxxx> wrote: > >> > >> On Sat, May 07, 2022 at 03:42:53PM +0200, Claudio Fontana wrote: > >>> This is v8 of the multifd save prototype, which fixes a few bugs, > >>> adds a few more code splits, and records the number of channels > >>> as well as the compression algorithm, so the restore command is > >>> more user-friendly. > >>> > >>> It is now possible to just say: > >>> > >>> virsh save mydomain /mnt/saves/mysave --parallel > >>> > >>> virsh restore /mnt/saves/mysave --parallel > >>> > >>> and things work with the default of 2 channels, no compression. > >>> > >>> It is also possible to say of course: > >>> > >>> virsh save mydomain /mnt/saves/mysave --parallel > >>> --parallel-connections 16 --parallel-compression zstd > >>> > >>> virsh restore /mnt/saves/mysave --parallel > >>> > >>> and things also work fine, due to channels and compression > >>> being stored in the main save file. > >> > >> For the sake of people following along, the above commands will > >> result in creation of multiple files > >> > >> /mnt/saves/mysave > >> /mnt/saves/mysave.0 > >> /mnt/saves/mysave.1 > >> .... > >> /mnt/saves/mysave.n > >> > >> Where 'n' is the number of threads used. > >> > >> Overall I'm not very happy with the approach of doing any of this > >> on the libvirt side. > >> > >> Backing up, we know that QEMU can directly save to disk faster than > >> libvirt can. We mitigated alot of that overhead with previous patches > >> to increase the pipe buffer size, but some still remains due to the > >> extra copies inherant in handing this off to libvirt. > >> > >> Using multifd on the libvirt side, IIUC, gets us better performance > >> than QEMU can manage if doing non-multifd write to file directly, > >> but we still have the extra copies in there due to the hand off > >> to libvirt. If QEMU were to be directly capable to writing to > >> disk with multifd, it should beat us again. > >> > >> As a result of how we integrate with QEMU multifd, we're taking the > >> approach of saving the state across multiple files, because it is > >> easier than trying to get multiple threads writing to the same file. > >> It could be solved by using file range locking on the save file. > >> eg a thread can reserve say 500 MB of space, fill it up, and then > >> reserve another 500 MB, etc, etc. It is a bit tedious though and > >> won't align nicely. eg a 1 GB huge page, would be 1 GB + a few > >> bytes of QEMU RAM ave state header. > > > I am not familiar enough to know if this approach would work with multifd without breaking > the existing format, maybe David could answer this. > > > > > First, I do not understand why you would write things that are > > not page-aligned to start with? (As an aside, I don’t know > > how any dirty tracking would work if you do not keep things > > page-aligned). > > Yes, alignment is one issue I encountered, and that in my view would _still_ need to be solved, > and that is _whatever_ we put inside QEMU in the future, > as it breaks also any attempt to be more efficient (using alternative APIs to read/write etc), > > and is the reason why iohelper is still needed in my patchset at all for the main file, causing one extra copy for the main channel. > > The libvirt header, including metadata, domain xml etc, that wraps the QEMU VM ends at an arbitrary address, f.e: > > 00000000: 4c69 6276 6972 7451 656d 7564 5361 7665 LibvirtQemudSave > 00000010: 0300 0000 5b13 0100 0100 0000 0000 0000 ....[........... > 00000020: 3613 0000 0200 0000 0000 0000 0000 0000 6............... > 00000030: 0000 0000 0000 0000 0000 0000 0000 0000 ................ > 00000040: 0000 0000 0000 0000 0000 0000 0000 0000 ................ > 00000050: 0000 0000 0000 0000 0000 0000 3c64 6f6d ............<dom > 00000060: 6169 6e20 7479 7065 3d27 6b76 6d27 3e0a ain type='kvm'>. > > > > 000113a0: 0000 0000 0000 0000 0000 0000 0000 0000 ................ > 000113b0: 0000 0000 0000 0051 4556 4d00 0000 0307 .......QEVM..... > 000113c0: 0000 000d 7063 2d69 3434 3066 782d 362e ....pc-i440fx-6. > 000113d0: 3201 0000 0003 0372 616d 0000 0000 0000 2......ram...... > 000113e0: 0004 0000 0008 c00c 2004 0670 632e 7261 ........ ..pc.ra > 000113f0: 6d00 0000 08c0 0000 0014 2f72 6f6d 4065 m........./rom@e > 00011400: 7463 2f61 6370 692f 7461 626c 6573 0000 tc/acpi/tables.. > 00011410: 0000 0002 0000 0770 632e 6269 6f73 0000 .......pc.bios.. > 00011420: 0000 0004 0000 1f30 3030 303a 3030 3a30 .......0000:00:0 > 00011430: 322e 302f 7669 7274 696f 2d6e 6574 2d70 2.0/virtio-net-p > 00011440: 6369 2e72 6f6d 0000 0000 0004 0000 0670 ci.rom.........p > 00011450: 632e 726f 6d00 0000 0000 0200 0015 2f72 c.rom........./r > 00011460: 6f6d 4065 7463 2f74 6162 6c65 2d6c 6f61 om@etc/table-loa > 00011470: 6465 7200 0000 0000 0010 0012 2f72 6f6d der........./rom > 00011480: 4065 7463 2f61 6370 692f 7273 6470 0000 @etc/acpi/rsdp.. > 00011490: 0000 0000 1000 0000 0000 0000 0010 7e00 ..............~. > 000114a0: 0000 0302 0000 0003 0000 0000 0000 2002 .............. . > 000114b0: 0670 632e 7261 6d00 0000 0000 0000 3022 .pc.ram.......0" > > > in my view at the minimum we have to start by adding enough padding before starting the QEMU VM (QEVM magic) > to be at a page-aligned address. > > I would add one patch to this effect to my prototype, as this should not be very controversial I think. We already add padding before the QEMU migration stream begins, but we're just doing a fixed 64kb. The intent was to allow us to edit the embedded XML. It could easily round this upto to a sensible boundary if needed. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|