Re: [libvirt RFC] add API for parallel Saves (not for committing)

Daniel P. Berrangé <berrange@xxxxxxxxxx> · Mon, 25 Apr 2022 12:15:03 +0100

On Mon, Apr 25, 2022 at 12:04:37PM +0100, Dr. David Alan Gilbert wrote:
> * Daniel P. Berrangé (berrange@xxxxxxxxxx) wrote:
> > I'm worried that we could be taking ourselves down a dead-end by
> > trying to optimize on the libvirt side, because we've got a
> > mismatch  between the QMP APIs we're using and the intent of
> > QEMU.
> > 
> > The QEMU migration APIs were designed around streaming to a
> > remote instance, and we're essentially playing games to use
> > them as a way to write to local storage.
> 
> Yes.
> 
> > The RAM pages we're saving are of course page aligned in QEMU
> > because they are mapped RAM. We loose/throwaway the page
> > alignment because we're sending them over a FD, potentially
> > adding in each metadata headers to identify which location
> > the RAM block came from. 
> > 
> > QEMU has APIs for doing async I/O to local storage using
> > O_DIRECT, via the BlockDev layer. QEMU can even use this
> > for saving state via the loadvm/savevm monitor commands
> > for internal snapshots. This is not accessible via the
> > normal migration QMP command though.
> > 
> > 
> > I feel to give ourselves the best chance of optimizing the
> > save/restore, we need to get QEMU to have full knowledge of
> > what is going on, and get libvirt out of the picture almost
> > entirely.
> > 
> > If QEMU knows that the migration source/target is a random
> > access file, rather than a stream, then it will not have
> > to attach any headers to identify RAM pages. It can just
> > read/write them directly at a fixed offset in the file.
> > It can even do this while the CPU is running, just overwriting
> > the previously written page on disk if the contents changed.
> > 
> > This would mean the save image is a fixed size exactly
> > matching the RAM size, plus libvirt header and vmstate.
> > Right now if we save a live snapshot, the save image can
> > be almost arbitrarily large, since we'll save the same
> > RAM page over & over again if the VM is modifying the
> > content.
> > 
> > I think we need to introduce an explicit 'file:' protocol
> > for the migrate command, that is backed by the blockdev APIs
> > so it can do O_DIRECT and non-blocking AIO.  For the 'fd:'
> > protocol, we need to be able to tell QEMU whether the 'fd'
> > is a stream or a regular file, so it can choose between the
> > regular send/recv APIs, vs the Blockdev APIs (maybe we can
> > auto-detect with fstat()).  If we do this, then multifd
> > doesn't end up needing multiple save files on disk, all
> > the threads can be directly writing to the same file, just
> > as the relevant offsets on disk to match the RAM page
> > location.
> 
> Hmm so what I'm not sure of is whether it makes sense to use the normal
> migration flow/code for this or not; and you're suggesting a few
> possibly contradictory things.
> 
> Adding a file: protocol would be pretty easy (whether it went via
> the blockdev layer or not); getting it to be more efficient is the
> tricky part, because we've got loads of levels of stream abstraction in
> the RAM save code:
>     QEMUFile->channel->OS
> but then if you want to enforce alignment you somehow have to make that
> go all the way down.

The QIOChannel stuff doesn't add buffering, so I wasn't worried
about alignment there.

QEMUFile has optional buffering which would mess with alignment,
but we could turn that off potentially for the RAM transfer, if
using multifd.

I'm confident the performance on the QMEU side though could
exceed what's viable with libvirt's iohelper  today, as we
would definitely be eliminating 1 copy and many context switches.

> If you weren't doing it live then you could come up with a mode
> that just did one big fat write(2) for each RAM Block; and frankly just
> sidestepped the entire rest of the RAM migration code.
> But then you're suggesting being able to do it live writing it into a
> fixed place on disk; which says that you have to change the (already
> complicated) RAM migration code rather than sidestepping it.

Yeah, we need "live" for the live snapshot - which fits in with
the previously discussed goal of turning the 'savevm/snapshot-save'
HMP/QMP impls into a facade around 'migrate' + 'block-copy' QMP
commands.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|