Re: [Qemu-devel] [QEMU PATCH v2 0/2]: KVM: i386: Add support for save and restore nested state

Liran Alon <liran.alon@xxxxxxxxxx> · Sat, 3 Nov 2018 04:02:37 +0200

> On 2 Nov 2018, at 18:39, Jim Mattson <jmattson@xxxxxxxxxx> wrote:
> 
> On Thu, Nov 1, 2018 at 8:46 PM, Liran Alon <liran.alon@xxxxxxxxxx> wrote:
> 
>> Hmm this makes sense.
>> 
>> This means though that the patch I have submitted here isn't good enough.
>> My patch currently assumes that when it attempts to get nested state from KVM,
>> QEMU should always set nested_state->size to max size supported by KVM as received
>> from kvm_check_extension(s, KVM_CAP_NESTED_STATE);
>> (See kvm_get_nested_state() introduced on my patch).
>> This indeed won't allow migration from host with new KVM to host with old KVM if
>> nested_state size was enlarged between these KVM versions.
>> Which is obviously an issue.
>> 
>> Jim, I think that my confusion was created from the fact that there is no clear documentation
>> on how KVM_{GET,SET}_NESTED_STATE should be changed once we will need to add more state to
>> nested_state in future KVM versions. I think it's worth adding that to IOCTLs documentation.
> 
> The nested state IOCTLs aren't unique in this respect. Any changes to
> the state saved by any of this whole family of state-saving ioctls
> require opt-in from userspace.
> 
>> For example, let's assume we have a new KVM_CAP_NESTED_STATE_V2.
>> In this scenario, does kvm_check_extension(s, KVM_CAP_NESTED_STATE) still returns the
>> size of nested_state v1 and kvm_check_extension(s, KVM_CAP_NESTED_STATE_V2) returns the
>> size of the nested_state v2?
> 
> Hmm...I don't recall kvm_check_extension(s, KVM_CAP_NESTED_STATE)
> being part of my original design. The way I had envisioned it,
> the set of capabilities enabled by userspace would be sufficient to
> infer the maximum data size.

If the set of capabilities should be sufficient to infer the max size of nested_state,
why did we code kvm_vm_ioctl_check_extension() such that on KVM_CAP_NESTED_STATE
it returns the max size of nested_state?

> 
> If, for example, we add a field to stash the time remaining for the
> VMCS12 VMX preemption timer, then presumably, userspace will enable it
> by enabling KVM_CAP_SAVE_VMX_PREEMPTION_TIMER (or something like
> that), and then userspace will know that the maximum nested state data
> is 4 bytes larger.

In that case, why did we defined struct kvm_nested_state to hold a blob of data[] instead
of separating the blob into well defined blobs? (e.g. Currently one blob for vmcs12 and another one for shadow vmcs12).
Then when we add a new component which is opt-in by a new KVM_CAP, we will add another well defined blob
to struct kvm_nested_state.

I think this is important because it allows us to specify in nested_state->flags which components are saved
and create multiple VMState subsections with needed() methods for the various saved components.

Thus allowing for example to easily still migrate from a new QEMU which does stash the time remaining for the VMCS12 VMX preemption timer
to an old QEMU which doesn’t stash it in case nested_state->flags specify that this component is not saved (Because L1 don’t use VMX preemption timer for example).

This seems to behave more nicely with how QEMU migration mechanism is defined and the purpose of VMState subsections.

In addition, if we will define struct kvm_nested_state like this, we will also not need the “size” field which needs to be carefully handled to avoid buffer overflows.
(We will just define large enough buffers (with padding) for each opaque component such as vmcs12 and shadow vmcs12).

> 
>> Also note that the approach suggested by Jim requires mgmt-layer at dest
>> to be able to specify to QEMU which KVM_CAP_NESTED_STATE_V* capabilities it should enable on kvm_init().
>> When we know we are migrating from a host which supports v1 to a host which supports v2,
>> we should make sure that dest QEMU doesn't enable KVM_CAP_NESTED_STATE_V2.
>> However, when we are just launching a new machine on the host which supports v2, we do want
>> QEMU to enable KVM_CAP_NESTED_STATE_V2 enabled for that VM.
> 
> No, no, no. Even when launching a new VM on a host that supports v2,
> you cannot enable v2 until you have passed rollback horizon. Should
> you decide to roll back the kernel with v2 support, you must be able
> to move that new VM to a host with an old kernel.

If we use VMState subsections as I described above, QEMU should be able to know which components of nested_state are
actively saved by KVM and therefore are *required* to be restored on dest host in order to migrate without guest issues after it is resumed on dest.
Therefore, still allowing migration from new hosts to old hosts in case guest didn’t enter a state which makes new saved state required in order
for migration to succeed.

If the mechanism will work like this, nested_state KVM_CAPs enabled on QEMU launch are only used to inform KVM which
struct kvm_nested_state is used by userspace. Not what is actually sent as part of migration stream.

What are your thoughts on this?

-Liran

> 
>> But on second thought, I'm not sure that this is the right approach as-well.
>> We don't really want the used version of nested_state to be determined on kvm_init().
>> * On source QEMU, we actually want to determine it when preparing for migration based
>> on to the support given by our destination host. If it's an old host, we would like to
>> save an old version nested_state and if it's a new host, we will like to save our newest
>> supported nested_state.
>> * On dest QEMU, we will want to just be able to set received nested_state in KVM.
>> 
>> Therefore, I don't think that we want this versioning to be based on KVM_CAP at all.
>> It seems that we would want the process to behave as follows:
>> 1) Mgmt-layer at dest queries dest host max supported nested_state size.
>>   (Which should be returned from kvm_check_extension(KVM_CAP_NESTED_STATE))
>> 2) Mgmt-layer at source initiate migration to dest with requesting QEMU to send nested_state
>>   matching dest max supported nested_state size.
>>   When saving nested state using KVM_GET_NESTED_STATE IOCTL, QEMU will specify in nested_state->size
>>   the *requested* size to be saved and KVM should be able to save only the information which matches
>>   the version that worked with that size.
>> 3) After some sanity checks on received migration stream, dest host use KVM_SET_NESTED_STATE IOCTL.
>>   This IOCTL should deduce which information it should deploy based on given nested_state->size.
>> 
>> This also makes me wonder if it's not just nicer to use nested_state->flags to specify which
>> information is actually present on nested_state instead of managing versioning with nested_state->size.
> 
> Yes, you can use nested_state->flags to determine what the data
> payload is, but you cannot enable new flags unless userspace opts in.
> This is just like KVM_CAP_EXCEPTION_PAYLOAD for kvm_vcpu_events. The
> flag, KVM_VCPUEVENT_VALID_PAYLOAD, can only be set on the saved vcpu
> events if userspace has opted-in with KVM_CAP_EXCEPTION_PAYLOAD. This
> is because older kernels will reject kvm_vcpu_events that have the
> KVM_VCPUEVENT_VALID_PAYLOAD flag set.
> 
> You don't need a new KVM_CAP_NESTED_STATE_V2 ioctl. You just need
> buy-in from userspace for any new data payload. Explicitly enumerating
> the payload components in the flags field makes perfect sense.