The other unfortunate thing about flushing the "current" VMCS12 state to guest memory on each L2->userspace transition is that much of this state is in the VMCS02. So,it's not just a matter of writing a VMCS12_SIZE blob to guest memory; first, the cached VMCS12 has to be updated from the VMCS02 by calling sync_vmcs12(). This will be particularly bad for double-nesting, where L1 kvm has to take all of those VMREAD VM-exits. If you still prefer this method, I will go ahead and do it, but I remain opposed. On Wed, Feb 15, 2017 at 1:28 PM, Paolo Bonzini <pbonzini@xxxxxxxxxx> wrote: > > > On 15/02/2017 19:19, Jim Mattson wrote: >> On Wed, Feb 15, 2017 at 9:37 AM, Paolo Bonzini <pbonzini@xxxxxxxxxx> wrote: >>> Yes, I think it is better. The really nasty part of >>> "KVM_GET_NESTED_STATE is not transparent to the guest" is that >>> KVM_GET_NESTED_STATE typically happens while the guest is quiescent and >>> userspace doesn't expect guest memory to change. (2) is a special case >>> of this. >>> >>> Instead, flushing the VMCS on L2->userspace exit happens during KVM_RUN, >>> which should not break any invariants that userspace relies on. >> >> It is certainly within the bounds of the Intel specification to flush the >> VMCS cache at any time, but I think it would be better if the VCPU behaved more >> deterministically (with respect to guest-visible events). > > Yes, it's a trade off. Of course if you think of SMIs on physical > hardware, determinism is already thrown away. But I understand that we > can do better than physical processors in this regard. :) > >> Just a proposal. I think a separate vmcs02 has the potential for faster emulated >> VM-entry/VM-exit, but at the cost of greater code complexity. A unified vmcs0X >> could certainly be as fast as the existing implementation and a lot simpler. > > I agree. As I said, we need to benchmark it either way. I honestly > have no idea where time is spent (on a processor with VMCS shadowing) in > a tight loop of L1->L2->L1 entries and exits. > >>>> To be honest, I'm not sure what it is that you are trying to optimize >>>> for. Is your concern the extra 4K of VCPU state? I would argue that >>>> the VMCS cache is part of the physical CPU state (though on a physical >>>> CPU, not every field is cached), and that it therefore makes sense to >>>> include the VMCS cache as part of the VCPU state, apart from guest >>>> physical memory. >>> >>> It's not really the extra 4K, but rather the ugly "blob" aspect of it >>> and having to maintain compatibility for it. Regarding the blobbiness, >>> see for example KVM_GET/SET_XSAVE, where you can actually marshal and >>> unmarshal the contents of the XSAVE area into the registers. Regarding >>> compatibility, I'm worried that once someone looks at SVM more closely, >>> we will have to store the VMCB in GET_NESTED_STATE too. >> >> Instead of a "blob," we could pass it out to userspace as a 'struct >> vmcs12,' padded out to 4k for compatibility purposes. Then userspace >> could marshal and unmarshal the contents, and it would be a little >> easier than marshalling and unmarshalling the same contents in guest >> physical memory. > > You mean marshalling/unmarshalling it for stuff such as extracting the > EPTP? Yeah, that's a good reason to keep it in GET_NESTED_STATE. > > OTOH, it's more work to do in userspace. Tradeoffs... > >> I don't know what to say about SVM. I think we should be prepared to >> store the VMCB at some point, even if we don't now. > > The in-memory VMCB is always consistent with processor state after > VMRUN, but not necessarily during it so I agree. Of course the same > solution of flushing it out on L2->userspace exit would work, though I > understand why you think it's a hack. > > Paolo