On 19.12.2017 20:21, Jim Mattson wrote: > Certain VMX state cannot be extracted from the kernel today. As you > point out, this includes the vCPU's VMX operating mode {legacy, VMX > root operation, VMX non-root operation}, the current VMCS GPA (if > any), and the VMXON region GPA (if any). Perhaps these could be > appended to the state(s) extracted by one or more existing APIs rather > than introducing a new API, but I think there's sufficient > justification here for a new GET/SET_NESTED_STATE API. > > Most L2 guest state can already be extracted by existing APIs, like > GET_SREGS. However, restoring it is a bit problematic today. SET_SREGS > will write into the current VMCS, but we have no existing mechanism > for transferring guest state from vmcs01 to vmcs02. On restore, do we > want to dictate that the vCPU's VMX operating mode has to be restored > before SET_SREGS is called, or do we provide a mechanism for > transferring vmcs01 guest state to vmcs02? If we do dictate that the > vCPU's operating mode has to be restored first, then SET_SREGS will > naturally write into vmcs02, but we'll have to create a mechanism for > building an initial vmcs02 out of nothing. A small rant before the Christmas holiday :) The more I look at nested VMX the more headaches I get. I know somebody thought it was a good idea to reuse the same execution loop in the guest for a nested guest. Simply swap some registers (introducing tons of BUGs e.g. related to interrupt injection, at least that's my impression) and there you go. I think this is a huge mess. As you say, state extraction/injection is a mess. As user space, you don't really know what you're holding in you hand. And from that stems the desire to just "dump out a huge blob of state and feed it back into the migration target". I can understand that. But I wonder if it has to be this way. Sometimes nested VMX feels like a huge hack into the existing VMX module. I once attempted to factor out certain pieces, and finally gave up. Maybe it is me and not the code :) > > The only mechanism we have today for building a vmcs02 starts with a > vmcs12. Building on that mechanism, it is fairly straightforward to > write GET/SET_NESTED_STATE. Though there is quite a bit of redundancy > with GET/SET_SREGS, GET_SET/VCPU_EVENTS, etc., if you capture all of > the L2 state in VMCS12 format, you can restore it pretty easily using > the existing infrastructure, without worrying about the ordering of > the SET_* ioctls. And exactly the redundancy you are talking about is what I am afraid of. Interfaces should be kept simple. If we can hide complexity, do so. But it all stems from the fact that we turn nested guest state into guest sate. And exit that way to user space. > > Today, the cached VMCS12 is loaded when the guest executes VMPTRLD, > primarily as a defense against the guest modifying VMCS12 fields in > memory after the hypervisor has checked their validity. There were a > lot of time-of-check to time-of-use security issues before the cached > VMCS12 was introduced. Conveniently, all but the host state of the > cached VMCS12 is dead once the vCPU enters L2, so it seemed like a > reasonable place to stuff the current L2 state for later restoration. > But why pass the cached VMCS12 as a separate vCPU state component > rather than writing it back to guest memory as part of the "save vCPU > state" sequence? > > One reason is that it is a bit awkward for GET_NESTED_STATE to modify > guest memory. I don't know about qemu, but our userspace agent expects > guest memory to be quiesced by the time it starts going through its > sequence of GET_* ioctls. Sure, we could introduce a pre-migration > ioctl, but is that the best way to handle this? Another reason is that > it is a bit awkward for SET_NESTED_STATE to require guest memory. > Again, I don't know about qemu, but our userspace agent does not > expect any guest memory to be available when it starts going through > its sequence of SET_* ioctls. Sure, we could prefetch the guest page > containing the current VMCS12, but is that better than simply > including the current VMCS12 in the NESTED_STATE payload? Moreover, > these unpredictable (from the guest's point of view) updates to guest > memory leave a bad taste in my mouth (much like SMM). In my perfect world, we would never leave with nested guest registers to user space. When migrating, all state is implicitly contained in guest memory. I might have a bit of a bias opinion here. On s390x we were able to make nested virtualization work completely transparent to user space. 1. every time we exit to user space, guest registers are guest registers 2. we have a separate nested execution loop for nested virtualization. 3. nested guest state is completely contained in guest memory. Whenever we stop executing the nested guest. (to return to the guest or to user space) We are able to do that by not remembering if we were in nested mode or not. When exiting to QEMU and reentering, we simply return to L1. L1 will figure out that there is nothing to do "0 intercept" and simply rerun its guest, resulting in us running L2. But even if we wanted to continue running L1, it would be as easy as rewinding the PC to point again at the SIE instruction before exiting to user space. Or introduce a separate flag. (at least I think it would be that easy) Not saying s390x implementation is perfect, but it seems to be way easier to handle compared to what we have with VMX right now. > > Perhaps qemu doesn't have the same limitations that our userspace > agent has, and I can certainly see why you would dismiss my concerns > if you are only interested in qemu as a userspace agent for kvm. At > the same time, I hope you can understand why I am not excited to be > drawn down a path that's going to ultimately mean more headaches for > me in my environment. AFAICT, the proposed API doesn't introduce any > additional headaches for those that use qemu. The principal objections > appear to be the "blob" of data, completely unstructured in the eyes > of the userspace agent, and the redundancy with state already > extracted by existing APIs. Is that right? I am wondering if we should rather try to redesign nVMX to work somehow similar to s390x way of running nested guests before we introduce a new API that makes the current design "fixed". Having that said, I assume this is higly unlikely (and many people will most likely object to what I have written in this mail). So please feel free to ignore what I have said in this email and go forward with your approach. Having migration is better than not having migration. Yes, I am a dreamer. :D -- Thanks, David / dhildenb