Certain VMX state cannot be extracted from the kernel today. As you point out, this includes the vCPU's VMX operating mode {legacy, VMX root operation, VMX non-root operation}, the current VMCS GPA (if any), and the VMXON region GPA (if any). Perhaps these could be appended to the state(s) extracted by one or more existing APIs rather than introducing a new API, but I think there's sufficient justification here for a new GET/SET_NESTED_STATE API. Most L2 guest state can already be extracted by existing APIs, like GET_SREGS. However, restoring it is a bit problematic today. SET_SREGS will write into the current VMCS, but we have no existing mechanism for transferring guest state from vmcs01 to vmcs02. On restore, do we want to dictate that the vCPU's VMX operating mode has to be restored before SET_SREGS is called, or do we provide a mechanism for transferring vmcs01 guest state to vmcs02? If we do dictate that the vCPU's operating mode has to be restored first, then SET_SREGS will naturally write into vmcs02, but we'll have to create a mechanism for building an initial vmcs02 out of nothing. The only mechanism we have today for building a vmcs02 starts with a vmcs12. Building on that mechanism, it is fairly straightforward to write GET/SET_NESTED_STATE. Though there is quite a bit of redundancy with GET/SET_SREGS, GET_SET/VCPU_EVENTS, etc., if you capture all of the L2 state in VMCS12 format, you can restore it pretty easily using the existing infrastructure, without worrying about the ordering of the SET_* ioctls. Today, the cached VMCS12 is loaded when the guest executes VMPTRLD, primarily as a defense against the guest modifying VMCS12 fields in memory after the hypervisor has checked their validity. There were a lot of time-of-check to time-of-use security issues before the cached VMCS12 was introduced. Conveniently, all but the host state of the cached VMCS12 is dead once the vCPU enters L2, so it seemed like a reasonable place to stuff the current L2 state for later restoration. But why pass the cached VMCS12 as a separate vCPU state component rather than writing it back to guest memory as part of the "save vCPU state" sequence? One reason is that it is a bit awkward for GET_NESTED_STATE to modify guest memory. I don't know about qemu, but our userspace agent expects guest memory to be quiesced by the time it starts going through its sequence of GET_* ioctls. Sure, we could introduce a pre-migration ioctl, but is that the best way to handle this? Another reason is that it is a bit awkward for SET_NESTED_STATE to require guest memory. Again, I don't know about qemu, but our userspace agent does not expect any guest memory to be available when it starts going through its sequence of SET_* ioctls. Sure, we could prefetch the guest page containing the current VMCS12, but is that better than simply including the current VMCS12 in the NESTED_STATE payload? Moreover, these unpredictable (from the guest's point of view) updates to guest memory leave a bad taste in my mouth (much like SMM). Perhaps qemu doesn't have the same limitations that our userspace agent has, and I can certainly see why you would dismiss my concerns if you are only interested in qemu as a userspace agent for kvm. At the same time, I hope you can understand why I am not excited to be drawn down a path that's going to ultimately mean more headaches for me in my environment. AFAICT, the proposed API doesn't introduce any additional headaches for those that use qemu. The principal objections appear to be the "blob" of data, completely unstructured in the eyes of the userspace agent, and the redundancy with state already extracted by existing APIs. Is that right? On Tue, Dec 19, 2017 at 9:40 AM, David Hildenbrand <david@xxxxxxxxxx> wrote: > On 19.12.2017 18:33, David Hildenbrand wrote: >> On 19.12.2017 18:26, Jim Mattson wrote: >>> Yes, it can be done that way, but what makes this approach technically >>> superior to the original API? >> >> a) not having to migrate data twice >> b) not having to think about a proper API to get data in/out >> >> All you need to know is, if the guest was in nested mode when migrating, >> no? That would be a simple flag. >> > > (of course in addition, vmcsptr and if vmxon has been called). > > But anyhow, if you have good reasons why you want to introduce and > maintain a new API, feel free to do so. Most likely I am missing > something important here :) > > > -- > > Thanks, > > David / dhildenb