Re: [Qemu-devel] [QEMU PATCH v2 0/2]: KVM: i386: Add support for save and restore nested state

Liran Alon <liran.alon@xxxxxxxxxx> · Thu, 8 Nov 2018 02:13:31 +0200

Ping on my last reply.
I would like to reach to an agreement on how v3 should look like before just implementing what I think is right.

Thanks,
-Liran

> On 3 Nov 2018, at 4:02, Liran Alon <liran.alon@xxxxxxxxxx> wrote:
> 
> 
> 
>> On 2 Nov 2018, at 18:39, Jim Mattson <jmattson@xxxxxxxxxx> wrote:
>> 
>> On Thu, Nov 1, 2018 at 8:46 PM, Liran Alon <liran.alon@xxxxxxxxxx> wrote:
>> 
>>> Hmm this makes sense.
>>> 
>>> This means though that the patch I have submitted here isn't good enough.
>>> My patch currently assumes that when it attempts to get nested state from KVM,
>>> QEMU should always set nested_state->size to max size supported by KVM as received
>>> from kvm_check_extension(s, KVM_CAP_NESTED_STATE);
>>> (See kvm_get_nested_state() introduced on my patch).
>>> This indeed won't allow migration from host with new KVM to host with old KVM if
>>> nested_state size was enlarged between these KVM versions.
>>> Which is obviously an issue.
>>> 
>>> Jim, I think that my confusion was created from the fact that there is no clear documentation
>>> on how KVM_{GET,SET}_NESTED_STATE should be changed once we will need to add more state to
>>> nested_state in future KVM versions. I think it's worth adding that to IOCTLs documentation.
>> 
>> The nested state IOCTLs aren't unique in this respect. Any changes to
>> the state saved by any of this whole family of state-saving ioctls
>> require opt-in from userspace.
>> 
>>> For example, let's assume we have a new KVM_CAP_NESTED_STATE_V2.
>>> In this scenario, does kvm_check_extension(s, KVM_CAP_NESTED_STATE) still returns the
>>> size of nested_state v1 and kvm_check_extension(s, KVM_CAP_NESTED_STATE_V2) returns the
>>> size of the nested_state v2?
>> 
>> Hmm...I don't recall kvm_check_extension(s, KVM_CAP_NESTED_STATE)
>> being part of my original design. The way I had envisioned it,
>> the set of capabilities enabled by userspace would be sufficient to
>> infer the maximum data size.
> 
> If the set of capabilities should be sufficient to infer the max size of nested_state,
> why did we code kvm_vm_ioctl_check_extension() such that on KVM_CAP_NESTED_STATE
> it returns the max size of nested_state?
> 
>> 
>> If, for example, we add a field to stash the time remaining for the
>> VMCS12 VMX preemption timer, then presumably, userspace will enable it
>> by enabling KVM_CAP_SAVE_VMX_PREEMPTION_TIMER (or something like
>> that), and then userspace will know that the maximum nested state data
>> is 4 bytes larger.
> 
> In that case, why did we defined struct kvm_nested_state to hold a blob of data[] instead
> of separating the blob into well defined blobs? (e.g. Currently one blob for vmcs12 and another one for shadow vmcs12).
> Then when we add a new component which is opt-in by a new KVM_CAP, we will add another well defined blob
> to struct kvm_nested_state.
> 
> I think this is important because it allows us to specify in nested_state->flags which components are saved
> and create multiple VMState subsections with needed() methods for the various saved components.
> 
> Thus allowing for example to easily still migrate from a new QEMU which does stash the time remaining for the VMCS12 VMX preemption timer
> to an old QEMU which doesn’t stash it in case nested_state->flags specify that this component is not saved (Because L1 don’t use VMX preemption timer for example).
> 
> This seems to behave more nicely with how QEMU migration mechanism is defined and the purpose of VMState subsections.
> 
> In addition, if we will define struct kvm_nested_state like this, we will also not need the “size” field which needs to be carefully handled to avoid buffer overflows.
> (We will just define large enough buffers (with padding) for each opaque component such as vmcs12 and shadow vmcs12).
> 
>> 
>>> Also note that the approach suggested by Jim requires mgmt-layer at dest
>>> to be able to specify to QEMU which KVM_CAP_NESTED_STATE_V* capabilities it should enable on kvm_init().
>>> When we know we are migrating from a host which supports v1 to a host which supports v2,
>>> we should make sure that dest QEMU doesn't enable KVM_CAP_NESTED_STATE_V2.
>>> However, when we are just launching a new machine on the host which supports v2, we do want
>>> QEMU to enable KVM_CAP_NESTED_STATE_V2 enabled for that VM.
>> 
>> No, no, no. Even when launching a new VM on a host that supports v2,
>> you cannot enable v2 until you have passed rollback horizon. Should
>> you decide to roll back the kernel with v2 support, you must be able
>> to move that new VM to a host with an old kernel.
> 
> If we use VMState subsections as I described above, QEMU should be able to know which components of nested_state are
> actively saved by KVM and therefore are *required* to be restored on dest host in order to migrate without guest issues after it is resumed on dest.
> Therefore, still allowing migration from new hosts to old hosts in case guest didn’t enter a state which makes new saved state required in order
> for migration to succeed.
> 
> If the mechanism will work like this, nested_state KVM_CAPs enabled on QEMU launch are only used to inform KVM which
> struct kvm_nested_state is used by userspace. Not what is actually sent as part of migration stream.
> 
> What are your thoughts on this?
> 
> -Liran
> 
>> 
>>> But on second thought, I'm not sure that this is the right approach as-well.
>>> We don't really want the used version of nested_state to be determined on kvm_init().
>>> * On source QEMU, we actually want to determine it when preparing for migration based
>>> on to the support given by our destination host. If it's an old host, we would like to
>>> save an old version nested_state and if it's a new host, we will like to save our newest
>>> supported nested_state.
>>> * On dest QEMU, we will want to just be able to set received nested_state in KVM.
>>> 
>>> Therefore, I don't think that we want this versioning to be based on KVM_CAP at all.
>>> It seems that we would want the process to behave as follows:
>>> 1) Mgmt-layer at dest queries dest host max supported nested_state size.
>>>  (Which should be returned from kvm_check_extension(KVM_CAP_NESTED_STATE))
>>> 2) Mgmt-layer at source initiate migration to dest with requesting QEMU to send nested_state
>>>  matching dest max supported nested_state size.
>>>  When saving nested state using KVM_GET_NESTED_STATE IOCTL, QEMU will specify in nested_state->size
>>>  the *requested* size to be saved and KVM should be able to save only the information which matches
>>>  the version that worked with that size.
>>> 3) After some sanity checks on received migration stream, dest host use KVM_SET_NESTED_STATE IOCTL.
>>>  This IOCTL should deduce which information it should deploy based on given nested_state->size.
>>> 
>>> This also makes me wonder if it's not just nicer to use nested_state->flags to specify which
>>> information is actually present on nested_state instead of managing versioning with nested_state->size.
>> 
>> Yes, you can use nested_state->flags to determine what the data
>> payload is, but you cannot enable new flags unless userspace opts in.
>> This is just like KVM_CAP_EXCEPTION_PAYLOAD for kvm_vcpu_events. The
>> flag, KVM_VCPUEVENT_VALID_PAYLOAD, can only be set on the saved vcpu
>> events if userspace has opted-in with KVM_CAP_EXCEPTION_PAYLOAD. This
>> is because older kernels will reject kvm_vcpu_events that have the
>> KVM_VCPUEVENT_VALID_PAYLOAD flag set.
>> 
>> You don't need a new KVM_CAP_NESTED_STATE_V2 ioctl. You just need
>> buy-in from userspace for any new data payload. Explicitly enumerating
>> the payload components in the flags field makes perfect sense.