Re: QEMU/KVM migration backwards compatibility broken?

"Dr. David Alan Gilbert" <dgilbert@xxxxxxxxxx> · Thu, 6 Jun 2019 09:42:22 +0100

* Liran Alon (liran.alon@xxxxxxxxxx) wrote:
> Hi,
> 
> Looking at QEMU source code, I am puzzled regarding how migration backwards compatibility is preserved regarding X86CPU.
> 
> As I understand it, fields that are based on KVM capabilities and guest runtime usage are defined in VMState subsections in order to not send them if not necessary.
> This is done such that in case they are not needed and we migrate to an old QEMU which don’t support loading this state, migration will still succeed
> (As .needed() method will return false and therefore this state won’t be sent as part of migration stream).
> Furthermore, in case .needed() returns true and old QEMU don’t support loading this state, migration fails. As it should because we are aware that guest state
> is not going to be restored properly on destination.
> 
> I’m puzzled about what will happen in the following scenario:
> 1) Source is running new QEMU with new KVM that supports save of some VMState subsection.
> 2) Destination is running new QEMU that supports load this state but with old kernel that doesn’t know how to load this state.
> 
> I would have expected in this case that if source .needed() returns true, then migration will fail because of lack of support in destination kernel.
> However, it seems from current QEMU code that this will actually succeed in many cases.
> 
> For example, if msr_smi_count is sent as part of migration stream (See vmstate_msr_smi_count) and destination have has_msr_smi_count==false,
> then destination will succeed loading migration stream but kvm_put_msrs() will actually ignore env->msr_smi_count and will successfully load guest state.
> Therefore, migration will succeed even though it should have failed…
> 
> It seems to me that QEMU should have for every such VMState subsection, a .post_load() method that verifies that relevant capability is supported by kernel
> and otherwise fail migration.
> 
> What do you think? Should I really create a patch to modify all these CPUX86 VMState subsections to behave like this?

I don't know the x86 specific side that much; but from my migration side
the answer should mostly be through machine types - indeed for smi-count
there's a property 'x-migrate-smi-count' which is off for machine types
pre 2.11 (see hw/i386/pc.c pc_compat_2_11) - so if you've got an old
kernel you should stick to the old machine types.

There's nothing guarding running the new machine type on old-kernels;
and arguably we should have a check at startup that complains if
your kernel is missing something the machine type uses.
However, that would mean that people running with -M pc   would fail
on old kernels.

A post-load is also a valid check; but one question is whether,
for a particular register, the pain is worth it - it depends on the
symptom that the missing state causes.  If it's minor then you might
conclude it's not worth a failed migration;  if it's a hung or
corrupt guest then yes it is.   Certainly a warning printed is worth
it.

Dave

> Thanks,
> -Liran
--
Dr. David Alan Gilbert / dgilbert@xxxxxxxxxx / Manchester, UK