Re: [libvirt RFC] add API for parallel Saves (not for committing)

Daniel P. Berrangé <berrange@xxxxxxxxxx> · Fri, 22 Apr 2022 13:17:56 +0100

On Fri, Apr 22, 2022 at 01:40:20PM +0200, Claudio Fontana wrote:
> On 4/22/22 10:19 AM, Daniel P. Berrangé wrote:
> > On Thu, Apr 21, 2022 at 08:06:40PM +0200, Claudio Fontana wrote:
> >> On 4/21/22 7:08 PM, Daniel P. Berrangé wrote:
> >>> On Thu, Apr 14, 2022 at 09:54:16AM +0200, Claudio Fontana wrote:
> >>>> RFC, starting point for discussion.
> >>>>
> >>>> Sketch API changes to allow parallel Saves, and open up
> >>>> and implementation for QEMU to leverage multifd migration to files,
> >>>> with optional multifd compression.
> >>>>
> >>>> This allows to improve save times for huge VMs.
> >>>>
> >>>> The idea is to issue commands like:
> >>>>
> >>>> virsh save domain /path/savevm --parallel --parallel-connections 2
> >>>>
> >>>> and have libvirt start a multifd migration to:
> >>>>
> >>>> /path/savevm   : main migration connection
> >>>> /path/savevm.1 : multifd channel 1
> >>>> /path/savevm.2 : multifd channel 2
> >>>
> >>> At a conceptual level the idea would to still have a single file,
> >>> but have threads writing to different regions of it. I don't think
> >>> that's possible with multifd though, as it doesn't partition RAM
> >>> up between threads, its just hands out pages on demand. So if one
> >>> thread happens to be quicker it'll send more RAM than another
> >>> thread. Also we're basically capturing the migration RAM, and the
> >>> multifd channels have control info, in addition to the RAM pages.
> >>>
> >>> That makes me wonder actually, are the multifd streams unidirectional
> >>> or bidirectional ?  Our saving to a file logic, relies on the streams
> >>> being unidirectional.
> >>
> >>
> >> Unidirectional. In the meantime I completed an actual libvirt prototype that works (only did the save part, not the restore yet).
> >>
> >>
> >>>
> >>> You've got me thinking, however, whether we can take QEMU out of
> >>> the loop entirely for saving RAM.
> >>>
> >>> IIUC with 'x-ignore-shared' migration capability QEMU will skip
> >>> saving of RAM region entirely (well technically any region marked
> >>> as 'shared', which I guess can cover more things). 
> >>
> >> Heh I have no idea about this.
> >>
> >>>
> >>> If the QEMU process is configured with a file backed shared
> >>> memory, or memfd, I wonder if we can take advantage of this.
> >>> eg
> >>>
> >>>   1. pause the VM
> >>>   1. write the libvirt header to save.img
> >>>   2. sendfile(qemus-memfd, save.img-fd)  to copy the entire
> >>>      RAM after header
> >>
> >> I don't understand this point very much... if the ram is already
> >> backed by file why are we sending this again..?
> > 
> > It is a file pointing to hugepagefs or tmpfs. It is still actually
> > RAM, but we exposed it to QEMU via a file, which QEMU then mmap'd.
> > 
> > We don't do this by default, but anyone with large (many GB) VMs
> > is increasingly likel to be relying on huge pages to optimize
> > their VM performance.
> 
> For what I could observe I'd say it depends on the specific scenario,
> how much memory we have to work with, the general compromise between cpu, memory, disk, ... all of which is subject to cost optimization.
> 
> > 
> > In our current save scheme we have (at least) 2 copies going
> > on. QEMU copies from RAM into the FD it uses for migrate.
> > libvirt IO helper copies from the FD into the file. This involves
> > multiple threads and multiple userspace/kernel switches and data
> > copies.  You've been trying to eliminate the 2nd copy in userspace.
> 
> I've been trying to eliminate the 2nd copy in userspace, but this is just aspect 1) I have in mind,
> it is good but gives only so much, and for huge VMs things fall apart when reaching the file cache trashing problem.

Agreed.

> Aspect 2) in my mind is the file cache trashing that the kernel gets into, is the reason that we need O_DIRECT at all with huge VMs I think,
> which creates a lot of complications (ie we are kinda forced to have a helper anyway to ensure block aligned source, destination addresses and length),
> and suboptimal performance.

Right, we can eliminate the second copy, or we can eliminate cache
trashing, but not both.

> Aspect 3) is a practical solution that I already prototyped and yields very good results in practice,
> which is to make better use of the resources we have, since we have a certain number of cpus assigned to run VMs,
> and the save/restore operations we need happen with a suspended guest, so we can exploit this to get those cpus to good use,
> and reduce the problem size by leveraging multifd and compression which comes for free from qemu.
> 
> I think that until the file cache issue remains unsolved, we are stuck with O_DIRECT, so we are stuck with a helper,
> and at that point we can easily have a
> 
> multifd-helper
> 
> that reuses the code from iohelper, and performs O_DIRECT writes of the compressed streams to multiple files in parallel.

I'm worried that we could be taking ourselves down a dead-end by
trying to optimize on the libvirt side, because we've got a
mismatch  between the QMP APIs we're using and the intent of
QEMU.

The QEMU migration APIs were designed around streaming to a
remote instance, and we're essentially playing games to use
them as a way to write to local storage.

The RAM pages we're saving are of course page aligned in QEMU
because they are mapped RAM. We loose/throwaway the page
alignment because we're sending them over a FD, potentially
adding in each metadata headers to identify which location
the RAM block came from. 

QEMU has APIs for doing async I/O to local storage using
O_DIRECT, via the BlockDev layer. QEMU can even use this
for saving state via the loadvm/savevm monitor commands
for internal snapshots. This is not accessible via the
normal migration QMP command though.

I feel to give ourselves the best chance of optimizing the
save/restore, we need to get QEMU to have full knowledge of
what is going on, and get libvirt out of the picture almost
entirely.

If QEMU knows that the migration source/target is a random
access file, rather than a stream, then it will not have
to attach any headers to identify RAM pages. It can just
read/write them directly at a fixed offset in the file.
It can even do this while the CPU is running, just overwriting
the previously written page on disk if the contents changed.

This would mean the save image is a fixed size exactly
matching the RAM size, plus libvirt header and vmstate.
Right now if we save a live snapshot, the save image can
be almost arbitrarily large, since we'll save the same
RAM page over & over again if the VM is modifying the
content.

I think we need to introduce an explicit 'file:' protocol
for the migrate command, that is backed by the blockdev APIs
so it can do O_DIRECT and non-blocking AIO.  For the 'fd:'
protocol, we need to be able to tell QEMU whether the 'fd'
is a stream or a regular file, so it can choose between the
regular send/recv APIs, vs the Blockdev APIs (maybe we can
auto-detect with fstat()).  If we do this, then multifd
doesn't end up needing multiple save files on disk, all
the threads can be directly writing to the same file, just
as the relevant offsets on disk to match the RAM page
location.

> > If we take advantage of scenario where QEMU RAM is backed by a
> > tmpfs/hugepagefs file, we can potentially eliminate both copies
> > in userspace. The kernel can be told to copy direct from the
> > hugepagefs file into the disk file.
> 
> Interesting, still we incur in the file cache trashing as we write though right?

I'm not sure to be honest. I struggle to find docs about whether
sendfile is compatible with an FD opened with O_DIRECT.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|