> The vmcs01 does serve as a cache of L1 state at the time of VM-entry, > so if we simply restored the vmcs01 state, that would take care of > most of the rollback issues, as long as we don't deliver any > interrupts in this context. However, I would like to see the > vmcs01/vmcs02 separation go away at some point. svm.c seems to do fine > with just once VMCB. Interesting point, might make things easier for VMX. > >> If the page >>> tables have changed, or the L1 guest has overwritten the >>> VMLAUNCH/VMRESUME instruction, then you're out of luck. >> >> Page tables getting changed by other CPUs is actually a good point. But >> I would consider both as "theoretical" problems. At least compared to >> the interrupt stuff, which could also happen on guests behaving in a >> more sane way. > > My preference is for solutions that are architecturally correct, > thereby solving the theoretical problems as well as the empirical > ones. However, I grant that the Linux community leans the other way in > general. I usually agree, unless it makes the code horribly complicated without any real benefit. (e.g. for corner cases like this one: a CPU modifying instruction text of another CPU which is currently executing them) >>> 3. I'm assuming that you're planning to store most of the current L2 >>> state in the cached VMCS12, at least where you can. Even so, the next >>> "VM-entry" can't perform any of the normal VM-entry actions that would >>> clobber the current L2 state that isn't in the cached VMCS12 (e.g. >>> processing the VM-entry MSR load list). So, you need to have a flag >>> indicating that this isn't a real VM-entry. That's no better than >>> carrying the nested_run_pending flag. >> >> Not sure if that would really be necessary (would have to look into the >> details first). But sounds like nested_run_pending seems unavoidable on >> x86. So I'd better get used to QEMU dealing with nested CPU state (which >> is somehow scary to me - an emulator getting involved in nested >> execution - what could go wrong :) ) > > For save/restore, QEMU doesn't actually have to know what the flag > means. It just has to pass it on. (Our userspace agent doesn't know > anything about the meaning of the flags in the kvm_vmx_state > structure.) I agree on save/restore. My comment was rather about involving a L0 emulator when running a L2 guest. But as Paolo pointed out, this can't really be avoided. -- Thanks, David / dhildenb