* Daniel P. Berrangé (berrange@xxxxxxxxxx) wrote: > On Mon, Apr 25, 2022 at 12:04:37PM +0100, Dr. David Alan Gilbert wrote: > > * Daniel P. Berrangé (berrange@xxxxxxxxxx) wrote: > > > I'm worried that we could be taking ourselves down a dead-end by > > > trying to optimize on the libvirt side, because we've got a > > > mismatch between the QMP APIs we're using and the intent of > > > QEMU. > > > > > > The QEMU migration APIs were designed around streaming to a > > > remote instance, and we're essentially playing games to use > > > them as a way to write to local storage. > > > > Yes. > > > > > The RAM pages we're saving are of course page aligned in QEMU > > > because they are mapped RAM. We loose/throwaway the page > > > alignment because we're sending them over a FD, potentially > > > adding in each metadata headers to identify which location > > > the RAM block came from. > > > > > > QEMU has APIs for doing async I/O to local storage using > > > O_DIRECT, via the BlockDev layer. QEMU can even use this > > > for saving state via the loadvm/savevm monitor commands > > > for internal snapshots. This is not accessible via the > > > normal migration QMP command though. > > > > > > > > > I feel to give ourselves the best chance of optimizing the > > > save/restore, we need to get QEMU to have full knowledge of > > > what is going on, and get libvirt out of the picture almost > > > entirely. > > > > > > If QEMU knows that the migration source/target is a random > > > access file, rather than a stream, then it will not have > > > to attach any headers to identify RAM pages. It can just > > > read/write them directly at a fixed offset in the file. > > > It can even do this while the CPU is running, just overwriting > > > the previously written page on disk if the contents changed. > > > > > > This would mean the save image is a fixed size exactly > > > matching the RAM size, plus libvirt header and vmstate. > > > Right now if we save a live snapshot, the save image can > > > be almost arbitrarily large, since we'll save the same > > > RAM page over & over again if the VM is modifying the > > > content. > > > > > > I think we need to introduce an explicit 'file:' protocol > > > for the migrate command, that is backed by the blockdev APIs > > > so it can do O_DIRECT and non-blocking AIO. For the 'fd:' > > > protocol, we need to be able to tell QEMU whether the 'fd' > > > is a stream or a regular file, so it can choose between the > > > regular send/recv APIs, vs the Blockdev APIs (maybe we can > > > auto-detect with fstat()). If we do this, then multifd > > > doesn't end up needing multiple save files on disk, all > > > the threads can be directly writing to the same file, just > > > as the relevant offsets on disk to match the RAM page > > > location. > > > > Hmm so what I'm not sure of is whether it makes sense to use the normal > > migration flow/code for this or not; and you're suggesting a few > > possibly contradictory things. > > > > Adding a file: protocol would be pretty easy (whether it went via > > the blockdev layer or not); getting it to be more efficient is the > > tricky part, because we've got loads of levels of stream abstraction in > > the RAM save code: > > QEMUFile->channel->OS > > but then if you want to enforce alignment you somehow have to make that > > go all the way down. > > The QIOChannel stuff doesn't add buffering, so I wasn't worried > about alignment there. > > QEMUFile has optional buffering which would mess with alignment, > but we could turn that off potentially for the RAM transfer, if > using multifd. The problem isn't whether they add buffering or not; the problem is you now need a way to add a mechanism to ask for alignment. > I'm confident the performance on the QMEU side though could > exceed what's viable with libvirt's iohelper today, as we > would definitely be eliminating 1 copy and many context switches. Yes but you get that just from adding a simple file: (or fd:) mode without trying to do anything clever with alignment or rewriting the same offset. > > If you weren't doing it live then you could come up with a mode > > that just did one big fat write(2) for each RAM Block; and frankly just > > sidestepped the entire rest of the RAM migration code. > > But then you're suggesting being able to do it live writing it into a > > fixed place on disk; which says that you have to change the (already > > complicated) RAM migration code rather than sidestepping it. > > Yeah, we need "live" for the live snapshot - which fits in with > the previously discussed goal of turning the 'savevm/snapshot-save' > HMP/QMP impls into a facade around 'migrate' + 'block-copy' QMP > commands. Dave > > With regards, > Daniel > -- > |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| > |: https://libvirt.org -o- https://fstop138.berrange.com :| > |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :| > -- Dr. David Alan Gilbert / dgilbert@xxxxxxxxxx / Manchester, UK