Ping on my last reply. I would like to reach to an agreement on how v3 should look like before just implementing what I think is right. Thanks, -Liran > On 3 Nov 2018, at 4:02, Liran Alon <liran.alon@xxxxxxxxxx> wrote: > > > >> On 2 Nov 2018, at 18:39, Jim Mattson <jmattson@xxxxxxxxxx> wrote: >> >> On Thu, Nov 1, 2018 at 8:46 PM, Liran Alon <liran.alon@xxxxxxxxxx> wrote: >> >>> Hmm this makes sense. >>> >>> This means though that the patch I have submitted here isn't good enough. >>> My patch currently assumes that when it attempts to get nested state from KVM, >>> QEMU should always set nested_state->size to max size supported by KVM as received >>> from kvm_check_extension(s, KVM_CAP_NESTED_STATE); >>> (See kvm_get_nested_state() introduced on my patch). >>> This indeed won't allow migration from host with new KVM to host with old KVM if >>> nested_state size was enlarged between these KVM versions. >>> Which is obviously an issue. >>> >>> Jim, I think that my confusion was created from the fact that there is no clear documentation >>> on how KVM_{GET,SET}_NESTED_STATE should be changed once we will need to add more state to >>> nested_state in future KVM versions. I think it's worth adding that to IOCTLs documentation. >> >> The nested state IOCTLs aren't unique in this respect. Any changes to >> the state saved by any of this whole family of state-saving ioctls >> require opt-in from userspace. >> >>> For example, let's assume we have a new KVM_CAP_NESTED_STATE_V2. >>> In this scenario, does kvm_check_extension(s, KVM_CAP_NESTED_STATE) still returns the >>> size of nested_state v1 and kvm_check_extension(s, KVM_CAP_NESTED_STATE_V2) returns the >>> size of the nested_state v2? >> >> Hmm...I don't recall kvm_check_extension(s, KVM_CAP_NESTED_STATE) >> being part of my original design. The way I had envisioned it, >> the set of capabilities enabled by userspace would be sufficient to >> infer the maximum data size. > > If the set of capabilities should be sufficient to infer the max size of nested_state, > why did we code kvm_vm_ioctl_check_extension() such that on KVM_CAP_NESTED_STATE > it returns the max size of nested_state? > >> >> If, for example, we add a field to stash the time remaining for the >> VMCS12 VMX preemption timer, then presumably, userspace will enable it >> by enabling KVM_CAP_SAVE_VMX_PREEMPTION_TIMER (or something like >> that), and then userspace will know that the maximum nested state data >> is 4 bytes larger. > > In that case, why did we defined struct kvm_nested_state to hold a blob of data[] instead > of separating the blob into well defined blobs? (e.g. Currently one blob for vmcs12 and another one for shadow vmcs12). > Then when we add a new component which is opt-in by a new KVM_CAP, we will add another well defined blob > to struct kvm_nested_state. > > I think this is important because it allows us to specify in nested_state->flags which components are saved > and create multiple VMState subsections with needed() methods for the various saved components. > > Thus allowing for example to easily still migrate from a new QEMU which does stash the time remaining for the VMCS12 VMX preemption timer > to an old QEMU which doesn’t stash it in case nested_state->flags specify that this component is not saved (Because L1 don’t use VMX preemption timer for example). > > This seems to behave more nicely with how QEMU migration mechanism is defined and the purpose of VMState subsections. > > In addition, if we will define struct kvm_nested_state like this, we will also not need the “size” field which needs to be carefully handled to avoid buffer overflows. > (We will just define large enough buffers (with padding) for each opaque component such as vmcs12 and shadow vmcs12). > >> >>> Also note that the approach suggested by Jim requires mgmt-layer at dest >>> to be able to specify to QEMU which KVM_CAP_NESTED_STATE_V* capabilities it should enable on kvm_init(). >>> When we know we are migrating from a host which supports v1 to a host which supports v2, >>> we should make sure that dest QEMU doesn't enable KVM_CAP_NESTED_STATE_V2. >>> However, when we are just launching a new machine on the host which supports v2, we do want >>> QEMU to enable KVM_CAP_NESTED_STATE_V2 enabled for that VM. >> >> No, no, no. Even when launching a new VM on a host that supports v2, >> you cannot enable v2 until you have passed rollback horizon. Should >> you decide to roll back the kernel with v2 support, you must be able >> to move that new VM to a host with an old kernel. > > If we use VMState subsections as I described above, QEMU should be able to know which components of nested_state are > actively saved by KVM and therefore are *required* to be restored on dest host in order to migrate without guest issues after it is resumed on dest. > Therefore, still allowing migration from new hosts to old hosts in case guest didn’t enter a state which makes new saved state required in order > for migration to succeed. > > If the mechanism will work like this, nested_state KVM_CAPs enabled on QEMU launch are only used to inform KVM which > struct kvm_nested_state is used by userspace. Not what is actually sent as part of migration stream. > > What are your thoughts on this? > > -Liran > >> >>> But on second thought, I'm not sure that this is the right approach as-well. >>> We don't really want the used version of nested_state to be determined on kvm_init(). >>> * On source QEMU, we actually want to determine it when preparing for migration based >>> on to the support given by our destination host. If it's an old host, we would like to >>> save an old version nested_state and if it's a new host, we will like to save our newest >>> supported nested_state. >>> * On dest QEMU, we will want to just be able to set received nested_state in KVM. >>> >>> Therefore, I don't think that we want this versioning to be based on KVM_CAP at all. >>> It seems that we would want the process to behave as follows: >>> 1) Mgmt-layer at dest queries dest host max supported nested_state size. >>> (Which should be returned from kvm_check_extension(KVM_CAP_NESTED_STATE)) >>> 2) Mgmt-layer at source initiate migration to dest with requesting QEMU to send nested_state >>> matching dest max supported nested_state size. >>> When saving nested state using KVM_GET_NESTED_STATE IOCTL, QEMU will specify in nested_state->size >>> the *requested* size to be saved and KVM should be able to save only the information which matches >>> the version that worked with that size. >>> 3) After some sanity checks on received migration stream, dest host use KVM_SET_NESTED_STATE IOCTL. >>> This IOCTL should deduce which information it should deploy based on given nested_state->size. >>> >>> This also makes me wonder if it's not just nicer to use nested_state->flags to specify which >>> information is actually present on nested_state instead of managing versioning with nested_state->size. >> >> Yes, you can use nested_state->flags to determine what the data >> payload is, but you cannot enable new flags unless userspace opts in. >> This is just like KVM_CAP_EXCEPTION_PAYLOAD for kvm_vcpu_events. The >> flag, KVM_VCPUEVENT_VALID_PAYLOAD, can only be set on the saved vcpu >> events if userspace has opted-in with KVM_CAP_EXCEPTION_PAYLOAD. This >> is because older kernels will reject kvm_vcpu_events that have the >> KVM_VCPUEVENT_VALID_PAYLOAD flag set. >> >> You don't need a new KVM_CAP_NESTED_STATE_V2 ioctl. You just need >> buy-in from userspace for any new data payload. Explicitly enumerating >> the payload components in the flags field makes perfect sense.