Re: Ways to deal with broken machine types

"Michael S. Tsirkin" <mst@xxxxxxxxxx> · Tue, 23 Mar 2021 15:40:24 -0400

On Tue, Mar 23, 2021 at 05:40:36PM +0000, Daniel P. Berrangé wrote:
> On Tue, Mar 23, 2021 at 05:54:47PM +0100, Igor Mammedov wrote:
> > Let me hijack this thread for beyond this case scope.
> > 
> > I agree that for this particular bug we've done all we could, but
> > there is broader issue to discuss here.
> > 
> > We have machine versions to deal with hw compatibility issues and that covers most of the cases,
> > but occasionally we notice problem well after release(s),
> > so users may be stuck with broken VM and need to manually fix configuration (and/or VM).
> > Figuring out what's wrong and how to fix it is far from trivial. So lets discuss if we
> > can help to ease this pain, yes it will be late for first victims but it's still
> > better than never.
> 
> To summarize the problem situation
> 
>  - We rely on a machine type version to encode a precise guest ABI.
>  - Due a bug, we are in a situation where the same machine type
>    encodes two distinct guest ABIs due to a mistake introduced
>    betwen QEMU N-2 and N-1
>  - We want to fix the bug in QEMU N
>  - For incoming migration there is no way to distinguish between
>    the ABIs used in N-2 and N-1, to pick the right one

Not just incoming migration. Same applies to a guest restart.

> So we're left with an unwinnable problem:
> 
>   - Not fixing the bug =>
> 
>        a) user migrating N-2 to N-1 have ABI change
>        b) user migrating N-2 to N have ABI change
>        c) user migrating N-1 to N are fine
> 
>     No mitigation for (a) or (b)
> 
>   - Fixing the bug =>
> 
>        a) user migrating N-2 to N-1 have ABI change.
>        b) user migrating N-2 to N are fine
>        c) user migrating N-1 to N have ABI change
> 
>     Bad situations (a) and (c) are mitigated by
>     backporting fix to N-1-stable too.
> 
> Generally we have preferred to fix the bug, because we have
> usually identified them fairly quickly after release, and
> backporting the fix to stable has been sufficient mitigation
> against ill effects. Basically the people left broken are a
> relatively small set out of the total userbase.
> 
> The real challenge arises when we are slow to identify the
> problem, such that we have a large number of people impacted.
> 
> 
> > I'll try to sum up idea Michael suggested (here comes my unorganized brain-dump),
> > 
> > 1. We can keep in VM's config QEMU version it was created on
> >    and as minimum warn user with a pointer to known issues if version in
> >    config mismatches version of actually used QEMU, with a knob to silence
> >    it for particular mismatch.
> > 
> > When an issue becomes know and resolved we know for sure how and what
> > changed and embed instructions on what options to use for fixing up VM's
> > config to preserve old HW config depending on QEMU version VM was installed on.
> 
> > some more ideas:
> >    2. let mgmt layer to keep fixup list and apply them to config if available
> >        (user would need to upgrade mgmt or update fixup list somehow)
> >    3. let mgmt layer to pass VM's QEMU version to currently used QEMU, so
> >       that QEMU could maintain and apply fixups based on QEMU version + machine type.
> >       The user will have to upgrade to newer QEMU to get/use new fixups.
> 
> The nice thing about machine type versioning is that we are treating the
> versions as opaque strings which represent a specific ABI, regardless of
> the QEMU version. This means that even if distros backport fixes for bugs
> or even new features, the machine type compatibility check remains a
> simple equality comparsion.
> 
> As soon as you introduce the QEMU version though, we have created a
> large matrix for compatibility.

Yes but. If we explicitly handle them all the same then
mechanically testing them all is an overkill.
We just need to test the ones that have bugs which we
care about fixing.

> This matrix is expanded if a distro
> chooses to backport fixes for any of the machine type bugs to their
> stable streams. This can get particularly expensive when there are
> multiple streams a distro is maintaining.
> 
> *IF* the original N-1 qemu has a property that could be queried by
> the mgmt app to identify a machine type bug, then we could potentially
> apply a fixup automatically.
> 
> eg query-machines command in QEMU version N could report against
> "pc-i440fx-5.0", that there was a regression fix that has to be
> applied if property "foo" had value "bar".
> 
> Now, the mgmt app wants to migrate from QEMU N-2 or N-1 to QEMU N.
> It can query the value of "foo" on the source QEMU with qom-get.
> It now knows whether it has to override this property "foo" when
> spawning QEMU N on the target host.
> 
> Of course this doesn't help us if neither N-1 or N-2 QEMU had a
> property that can be queried to identify the bug - ie if the
> property in question was newly introduced in QEMU N to fix the
> bug.
> 
> > In my opinion both would lead to explosion of 'possibly needed' properties for each
> > change we introduce in hw/firmware(read ACPI) and very possibly a lot of conditional
> > branches in QEMU code. And I'm afraid it will become hard to maintain QEMU =>
> > more bugs in future.
> > Also it will lead to explosion of test matrix for downstreams who care about testing.
> > 
> > If we proactively gate changes on properties, we can just update fixup lists in mgmt,
> > without need to update QEMU (aka Insite rules) at a cost of complexity on QMEU side.
> > 
> > Alternatively we can be conservative in spawning new properties, that means creating
> > them only when issue is fixed and require users to update QEMU, so that fixups could
> > be applied to VM.
> > 
> > Feel free to shoot the messenger down or suggest ways how we can deal with the problem.
> 
> The best solution is of course to not have introduced the ABI change in
> the first place. We have lots of testing, but upstream at least, I don't
> think we have anything that is explicitly recording the ABI associated
> with each machine type and validating that it hasn't changed. We rely on
> the developers to follow the coding practices wrt setting machine type
> defaults for back compat, and while we're good, we inevitably screw up
> every now & then.
> 
> Downstreams do have some of this ABI testing - several problems like the
> one we have there, have been identified when RHEL downstream QE did
> migration tests and found a change in RHEL machine types, which then
> was traced back to upstream.
> 
> I feel like we need some standard tool which can be run inside a VM
> that dumps all the possible ABI relevant information about the virtual
> machine in a nice data format.
> 
> We would have to run this for each machine type, and save the
> results to git immediately after release. Then for every change to
> master, we would have to run the test again for every historic
> machine type version and compare to the recorded ABI record.
> 
> Regards,
> Daniel

Unfortunately I do not think this is practical :(.

All examples of breakage I am aware of, we did not
realise some part of interface was part of guest ABI
and unsafe to change. We simply would not know to write a
test for it.

> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|