Re: [PATCH RFC 0/9] qemu: Support mapped-ram migration capability

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 8/7/24 09:45, Daniel P. Berrangé wrote:
On Thu, Jun 13, 2024 at 04:43:14PM -0600, Jim Fehlig via Devel wrote:
This series is a RFC for support of QEMU's mapped-ram migration
capability [1] for saving and restoring VMs. It implements the first
part of the design approach we discussed for supporting parallel
save/restore [2]. In summary, the approach is

1. Add mapped-ram migration capability
2. Steal an element from save header 'unused' for a 'features' variable
    and bump save version to 3.
3. Add /etc/libvirt/qemu.conf knob for the save format version,
    defaulting to latest v3
4. Use v3 (aka mapped-ram) by default
5. Use mapped-ram with BYPASS_CACHE for v3, old approach for v2
6. include: Define constants for parallel save/restore
7. qemu: Add support for parallel save. Implies mapped-ram, reject if v2
8. qemu: Add support for parallel restore. Implies mapped-ram.
    Reject if v2
9. tools: add parallel parameter to virsh save command
10. tools: add parallel parameter to virsh restore command

This series implements 1-5, with the BYPASS_CACHE support in patches 8
and 9 being quite hacky. They are included to discuss approaches to make
them less hacky. See the patches for details.

I'm a little on the edge about the user interface for the choice
of formats - whether the version number knob in qemu.conf is the
ideal approach or not.

The QEMU mapped-ram capability currently does not support directio.
Fabino is working on that now [3]. This complicates merging support
in libvirt. I don't think it's reasonable to enable mapped-ram by
default when BYPASS_CACHE cannot be supported. Should we wait until
the mapped-ram directio support is merged in QEMU before supporting
mapped-ram in libvirt?

For the moment, compression is ignored in the new save version.
Currently, libvirt connects the output of QEMU's save stream to the
specified compression program via a pipe. This approach is incompatible
with mapped-ram since the fd provided to QEMU must be seekable. One
option is to reopen and compress the saved image after the actual save
operation has completed. This has the downside of requiring the iohelper
to handle BYPASS_CACHE, which would preclude us from removing it
sometime in the future. Other suggestions much welcomed.

Going back to the original motivation for mapped-ram. The first key
factor was that it will make it more viable to use multi-fd for
parallelized saving and restoring, as it lets threads write concurrently
without needing synchronization. The predictable worst case file size
when the VM is live & dirtying memory, was an added benefit.

The the streaming format, we can compress as we stream, so adding
compression burns more CPU, but this is parallelized with the QEMU
saving, so the wallclock time doesn't increase as badly.

With mapped-ram, if we want to compress, we need to wait for the
image to be fully written, as we can be re-writing regions if the
guest is live. IOW, compression cannot be parallelized with the
file writing. So compression will add significant wallclock time,
as well as having the transient extra disk usage penalty.

We can optimize to avoid the extra disk penalty by using discard
on the intermediate file every time we read a chunk. eg this
loop:

   while () {
        read 1 MB from save file
        discard 1MB from save file
        write 1 MB to compressor pipe
   }

We can't avoid the wallclock penalty from compression, and as you
say, we still need the iohelper chained after the compressor to
get O_DIRECT.

Note the logical file size of mapped-ram saved images is slightly
larger than guest RAM size, so the files are often much larger than the
files produced by the existing, sequential format. However, actual blocks
written to disk is often lower with mapped-ram saved images. E.g. a saved
image from a 30G, freshly booted, idle guest results in the following
'Size' and 'Blocks' values reported by stat(1)

                  Size         Blocks
sequential     998595770      1950392
mapped-ram     34368584225    1800456

With the same guest running a workload that dirties memory

                  Size         Blocks
sequential     33173330615    64791672
mapped-ram     34368578210    64706944

I'm a little concerned this difference between sparse and non-sparse
formats could trip up existing applications that are using libvirt.

eg, openstack uses the save/restore to file facilities and IIUC can
upload the saved file to its image storage (glance). Experiance tells
us that applications will often not handle sparseness correctly if
they have never been expecting it.


Overall I'm wondering if we need to give a direct choice to mgmt
apps.

We added a the save/restore variants that accept virTypedParameters,
so we could define a VIR_DOMAIN_SAVE_PARAM_FORMAT, which accepts
'stream' and 'mapped' as options. This choice would then influence
whether we save in v2 or v3 format.  On restore we don't need a
parameter as we just probe the on disk format.

As a documentation task we can then save that compression is
incompatible with 'mapped'.

Annoyingly we already have a 'save_image_formt' in qemu.conf though
taking  'raw', 'zstd', 'lzop', etc to choose the compression type.
So we have a terminology clash.

Thinking about this more, and your previous idea about abusing the header 'compressed' field [1], I'm beginning to warm to the idea of adding a new item to the list currently accepted by save_image_format. I'm also warming to Fabiano's suggestion to call the new item 'sparse' :-).

AFAICT, from a user perspective, save_image_format does not imply compression. The implication primarily stems from variable and function naming in the code :-). The current documentation of save_image_format may slightly imply compression, but we can easily improve that. The setting already accepts 'raw' for the existing (non-compressed) stream format. Any opinions on adding 'sparse' as another supported save_image_format?

Regards,
Jim

[1] https://lists.libvirt.org/archives/list/devel@xxxxxxxxxxxxxxxxx/message/PP3XCRF2DW4ZQC7NJUHNL4RHNDJ3PFKS/




[Index of Archives]     [Virt Tools]     [Libvirt Users]     [Lib OS Info]     [Fedora Users]     [Fedora Desktop]     [Fedora SELinux]     [Big List of Linux Books]     [Yosemite News]     [KDE Users]     [Fedora Tools]

  Powered by Linux