On Thu, 2021-05-27 at 10:01 +0200, Vitaly Kuznetsov wrote: > Maxim Levitsky <mlevitsk@xxxxxxxxxx> writes: > > > On Mon, 2021-05-24 at 14:44 +0200, Vitaly Kuznetsov wrote: > > > Maxim Levitsky <mlevitsk@xxxxxxxxxx> writes: > > > > > > > On Mon, 2021-05-17 at 15:50 +0200, Vitaly Kuznetsov wrote: > > > > > Changes since v1 (Sean): > > > > > - Drop now-unneeded curly braces in nested_sync_vmcs12_to_shadow(). > > > > > - Pass 'evmcs->hv_clean_fields' instead of 'bool from_vmentry' to > > > > > copy_enlightened_to_vmcs12(). > > > > > > > > > > Commit f5c7e8425f18 ("KVM: nVMX: Always make an attempt to map eVMCS after > > > > > migration") fixed the most obvious reason why Hyper-V on KVM (e.g. Win10 > > > > > + WSL2) was crashing immediately after migration. It was also reported > > > > > that we have more issues to fix as, while the failure rate was lowered > > > > > signifincatly, it was still possible to observe crashes after several > > > > > dozens of migration. Turns out, the issue arises when we manage to issue > > > > > KVM_GET_NESTED_STATE right after L2->L2 VMEXIT but before L1 gets a chance > > > > > to run. This state is tracked with 'need_vmcs12_to_shadow_sync' flag but > > > > > the flag itself is not part of saved nested state. A few other less > > > > > significant issues are fixed along the way. > > > > > > > > > > While there's no proof this series fixes all eVMCS related problems, > > > > > Win10+WSL2 was able to survive 3333 (thanks, Max!) migrations without > > > > > crashing in testing. > > > > > > > > > > Patches are based on the current kvm/next tree. > > > > > > > > > > Vitaly Kuznetsov (7): > > > > > KVM: nVMX: Introduce nested_evmcs_is_used() > > > > > KVM: nVMX: Release enlightened VMCS on VMCLEAR > > > > > KVM: nVMX: Ignore 'hv_clean_fields' data when eVMCS data is copied in > > > > > vmx_get_nested_state() > > > > > KVM: nVMX: Force enlightened VMCS sync from nested_vmx_failValid() > > > > > KVM: nVMX: Reset eVMCS clean fields data from prepare_vmcs02() > > > > > KVM: nVMX: Request to sync eVMCS from VMCS12 after migration > > > > > KVM: selftests: evmcs_test: Test that KVM_STATE_NESTED_EVMCS is never > > > > > lost > > > > > > > > > > arch/x86/kvm/vmx/nested.c | 110 ++++++++++++------ > > > > > .../testing/selftests/kvm/x86_64/evmcs_test.c | 64 +++++----- > > > > > 2 files changed, 115 insertions(+), 59 deletions(-) > > > > > > > > > > > > > Hi Vitaly! > > > > > > > > In addition to the review of this patch series, > > > > > > Thanks by the way! > > No problem! > > > > > > I would like > > > > to share an idea on how to avoid the hack of mapping the evmcs > > > > in nested_vmx_vmexit, because I think I found a possible generic > > > > solution to this and similar issues: > > > > > > > > The solution is to always set nested_run_pending after > > > > nested migration (which means that we won't really > > > > need to migrate this flag anymore). > > > > > > > > I was thinking a lot about it and I think that there is no downside to this, > > > > other than sometimes a one extra vmexit after migration. > > > > > > > > Otherwise there is always a risk of the following scenario: > > > > > > > > 1. We migrate with nested_run_pending=0 (but don't restore all the state > > > > yet, like that HV_X64_MSR_VP_ASSIST_PAGE msr, > > > > or just the guest memory map is not up to date, guest is in smm or something > > > > like that) > > > > > > > > 2. Userspace calls some ioctl that causes a nested vmexit > > > > > > > > This can happen today if the userspace calls > > > > kvm_arch_vcpu_ioctl_get_mpstate -> kvm_apic_accept_events -> kvm_check_nested_events > > > > > > > > 3. Userspace finally sets correct guest's msrs, correct guest memory map and only > > > > then calls KVM_RUN > > > > > > > > This means that at (2) we can't map and write the evmcs/vmcs12/vmcb12 even > > > > if KVM_REQ_GET_NESTED_STATE_PAGES is pending, > > > > but we have to do so to complete the nested vmexit. > > > > > > Why do we need to write to eVMCS to complete vmexit? AFAICT, there's > > > only one place which calls copy_vmcs12_to_enlightened(): > > > nested_sync_vmcs12_to_shadow() which, in its turn, has only 1 caller: > > > vmx_prepare_switch_to_guest() so unless userspace decided to execute > > > not-fully-restored guest this should not happen. I'm probably missing > > > something in your scenario) > > You are right! > > The evmcs write is delayed to the next vmentry. > > > > However since we are now mapping the evmcs during nested vmexit, > > and this can fail for example that HV assist msr is not up to date. > > > > For example consider this: > > > > 1. Userspace first sets nested state > > 2. Userspace calls KVM_GET_MP_STATE. > > 3. Nested vmexit that happened in 2 will end up not be able to map the evmcs, > > since HV_ASSIST msr is not yet loaded. > > > > > > Also the vmcb write (that is for SVM) _is_ done right away on nested vmexit > > and conceptually has the same issue. > > (if memory map is not up to date, we might not be able to read/write the > > vmcb12 on nested vmexit) > > > > It seems we have one correct way to restore a guest and a number of > incorrect ones :-) It may happen that this is not even a nested-only > thing (think about trying to resore caps, regs, msrs, cpuids, in a > random sequence). I'd vote for documenting the right one somewhere, even > if we'll just be extracting it from QEMU. > > > > > To some extent, the entry to the nested mode after a migration is only complete > > > > when we process the KVM_REQ_GET_NESTED_STATE_PAGES, so we shoudn't interrupt it. > > > > > > > > This will allow us to avoid dealing with KVM_REQ_GET_NESTED_STATE_PAGES on > > > > nested vmexit path at all. > > > > > > Remember, we have three possible states when nested state is > > > transferred: > > > 1) L2 was running > > > 2) L1 was running > > > 3) We're in beetween L2 and L1 (need_vmcs12_to_shadow_sync = true). > > > > I understand. This suggestion wasn't meant to fix the case 3, but more to fix > > case 1, where we are in L2, migrate, and then immediately decide to > > do a nested vmexit before we processed the KVM_REQ_GET_NESTED_STATE_PAGES > > request, and also before potentially before the guest state was fully uploaded > > (see that KVM_GET_MP_STATE thing). > > > > In a nutshell, I vote for not allowing nested vmexits from the moment > > when we set the nested state and until the moment we enter the nested > > guest once (maybe with request for immediate vmexit), > > because during this time period, the guest state is not fully consistent. > > > > Using 'nested_run_pending=1' perhaps? Or, we can get back to 'vm_bugged' > idea and kill the guest immediately if something forces such an exit. Exactly, this is my idea. Set the nested_run_pending=1 always after the migration It shoudn't cause any issues and it would avoid cases like that. That variable can then be renamed too to something like 'nested_vmexit_not_allowed' or something like that. Paolo, what do you think? Best regards, Maxim Levitsky >