On 4/26/24 4:04 AM, Daniel P. Berrangé wrote:
On Wed, Apr 17, 2024 at 05:12:27PM -0600, Jim Fehlig via Devel wrote:
A good starting point on this journey is supporting the new mapped-ram
capability in qemu 9.0 [2]. Since mapped-ram is a new on-disk format, I
assume we'll need a new QEMU_SAVE_VERSION 3 when using it? Otherwise I'm not
sure how to detect if a saved image is in mapped-ram format vs the existing,
sequential stream format.
Yes, we'll need to be supporting 'mapped-ram', so a good first step.
A question is whether we make that feature mandatory for all save images,
or implied by another feature (parallel save), or an directly controllable
feature with opt-in.
It feels more like an implementation detail.
The former breaks back compat with existnig libvirt, while the latter 2
options are net new so don't have compat implications.
In terms of actual data blocks written on disk mapped-ram should be be the
same size, or smaller, than the existing format.
In terms of logical file size, however, mapped-ram will almost always be
larger.
Correct. E.g. from a mostly idle 8G VM
# stat existing-format.sav
Size: 510046983 Blocks: 996192 IO Block: 4096 regular file
# stat mapped-ram-format.sav
Size: 8597730739 Blocks: 956200 IO Block: 4096 regular file
The upside is mapped-ram is bounded, unlike the existing stream, which can
result in actual file sizes much greater than RAM size when VM runs a memory
intensive workload.
This is because mapped-ram will result in a file whose logical size matches
the guest RAM size, plus some header overhead, while being sparse so not
all blocks are written.
If tools handling save images aren't sparse-aware this could come across
as a surprise and even be considered a regression.
Yes, I already had visions of phone ringing off the hook asking "why are my save
images suddenly huge?". But maybe it's tolerable once they realize actual blocks
used, and when combined with parallel they could also be asking "why are saves
suddenly so fast?" :-).
Mapped ram is needed for parallel saves since it lets each thread write
to a specific region of the file.
Mapped ram is good for non-parallel saves too though, because the mapping
of RAM into the file is aligned suitably to allow for O_DIRECT to be used.
Currently libvirt has to tunnnel over its iohelper to futz alignment
needed for O_DIRECT. This makes it desirable to use in general, but back
compat hurts...
My POC avoids the use of iohelper with mapped-ram. It provides qemu with two fds
when direct-io has been requested, one opened with O_DIRECT, one without.
Looking at what we did in the past
First time, we stole a element from 'uint32_t unused[..]' in the
save header, to add the 'compressed' field, and bumped the
version. This prevented old libvirt reading the files. This was
needed as adding compression was a non-backwards compatible
change. We could have carried on using version 1 for non-compressd
fields, but we didn't for some reason. It was a hard compat break.
Next time, we stole a element from 'uint32 unused[..]' in the
save header, to add the 'cookie_len' field, but did NOT bump
the version. 'unused' is always all zeroes, so new libvirt could
detect whether the cookie was present by the len being non-zero.
Old libvirt would still load the image, but would be ignoring
the cookie data. This was largely harmless.
This time mapped-ram is a non-compatible change, so we need to
ensure old libvirt won't try to read the files, which suggests
either a save version bump, or we could abuse the 'compressed'
field to indicate 'mapped-ram' as a form of compression.
If we did a save version bump, we might want to carrry on using
v2 for non mapped ram.
IIUC, mapped-ram cannot be used with the exiting 'fd:' migration URI and
instead must use 'file:'. Does qemu advertise support for that? I couldn't
find it. If not, 'file:' (available in qemu 8.2) predates mapped-ram, so in
theory we could live without the advertisement.
'mapped-ram' is reported in QMP as a MigrationCapability, so I think we
can probe for it directly.
Yes, mapped-ram is reported. Sorry for not being clear, but I was asking if qemu
advertised support for the 'file:' migration URI it gained in 8.2? Probably not
a problem either way since it predates mapped-ram.
Yes, it is exclusively for use with 'file:' protocol. If we want to use
FD passing, then we can still do that with 'file:', by using QEMU's
generic /dev/fdset/NNN approach we have with block devices.
It's also not clear when we want to enable the mapped-ram capability. Should
it always be enabled if supported by the underlying qemu? One motivation for
creating the mapped-ram was to support direct-io of the migration stream in
qemu, in which case it could be tied to VIR_DOMAIN_SAVE_BYPASS_CACHE. E.g.
the mapped-ram capability is enabled when user specifies
VIR_DOMAIN_SAVE_BYPASS_CACHE && user-provided path results in a seekable fd
&& qemu supports mapped-ram?
One option is to be lazy and have a /etc/libvirt/qemu.conf for the
save format version, defaulting to latest v3. Release note that
admin/host provisioning apps must set it to v2 if back compat is
needed with old libvirt. If we assume new -> old save image loading
is relatively rare, that's probably good enough.
IOW, we can
* Bump save version to 3
* Use v3 by default
* Add a SAVE_PARALLEL flag which implies mapped-ram, reject
if v2
* Use mapped RAM with BYPASS_CACHE for v3, old approach for v2
* Steal another unused field to indicate use of mapped-ram,
or perhaps future proof it by declaring a 'features'
field. So we don't need to bump version again, just make
sure that the libvirt loading an image supports all
set features.
This sounds like a reasonable start. Thanks for the feedback.
Regards,
Jim
Looking ahead, should the mapped-ram capability be required for supporting
the VIR_DOMAIN_SAVE_PARALLEL flag? As I understand, parallel save/restore
was another motivation for creating the mapped-ram feature. It allows
multifd threads to write exclusively to the offsets provided by mapped-ram.
Can multiple multifd threads concurrently write to an fd without mapped-ram?
Yes, mapped-ram should be a pre-requisite.
With regards,
Daniel
_______________________________________________
Devel mailing list -- devel@xxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxx