On Thu, Apr 07, 2022, Maciej S. Szmigiero wrote: > On 6.04.2022 22:52, Sean Christopherson wrote: > > On Wed, Apr 06, 2022, Maciej S. Szmigiero wrote: > > > Another option for saving and restoring a VM would be to add it to > > > KVM_{GET,SET}_NESTED_STATE somewhere (maybe as a part of the saved VMCB12 > > > control area?). > > > > Ooh. What if we keep nested_run_pending=true until the injection completes? Then > > we don't even need an extra flag because nested_run_pending effectively says that > > any and all injected events are for L1=>L2. In KVM_GET_NESTED_STATE, shove the > > to-be-injected event into the normal vmc*12 injection field, and ignore all > > to-be-injected events in KVM_GET_VCPU_EVENTS if nested_run_pending=true. > > > > That should work even for migrating to an older KVM, as keeping nested_run_pending > > will cause the target to reprocess the event injection as if it were from nested > > VM-Enter, which it technically is. > > I guess here by "ignore all to-be-injected events in KVM_GET_VCPU_EVENTS" you > mean *moving* back the L1 -> L2 event to be injected from KVM internal data > structures like arch.nmi_injected (and so on) to the KVM_GET_NESTED_STATE-returned > VMCB12 EVENTINJ field (or its VMX equivalent). > > But then the VMM will need to first call KVM_GET_NESTED_STATE (which will do > the moving), only then KVM_GET_VCPU_EVENTS (which will then no longer show > these events as pending). > And their setters in the opposite order when restoring the VM. I wasn't thinking of actually moving things in the source VM, only ignoring events in KVM_GET_VCPU_EVENTS. Getting state shouldn't be destructive, e.g. the source VM should still be able to continue running. Ahahahaha, and actually looking at the code, there's this gem in KVM_GET_VCPU_EVENTS /* * The API doesn't provide the instruction length for software * exceptions, so don't report them. As long as the guest RIP * isn't advanced, we should expect to encounter the exception * again. */ if (kvm_exception_is_soft(vcpu->arch.exception.nr)) { events->exception.injected = 0; events->exception.pending = 0; } and again for soft interrupts events->interrupt.injected = vcpu->arch.interrupt.injected && !vcpu->arch.interrupt.soft; so through KVM's own incompetency, it's already doing half the work. This is roughly what I had in mind. It will "require" moving nested_run_pending to kvm_vcpu_arch, but I've been itching for an excuse to do that anyways. diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index eb71727acecb..62c48f6a0815 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4846,6 +4846,8 @@ static int kvm_vcpu_ioctl_x86_set_mce(struct kvm_vcpu *vcpu, static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu, struct kvm_vcpu_events *events) { + bool drop_injected_events = vcpu->arch.nested_run_pending; + process_nmi(vcpu); if (kvm_check_request(KVM_REQ_SMI, vcpu)) @@ -4872,7 +4874,8 @@ static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu, * isn't advanced, we should expect to encounter the exception * again. */ - if (kvm_exception_is_soft(vcpu->arch.exception.nr)) { + if (drop_injected_events || + kvm_exception_is_soft(vcpu->arch.exception.nr)) { events->exception.injected = 0; events->exception.pending = 0; } else { @@ -4893,13 +4896,14 @@ static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu, events->exception_has_payload = vcpu->arch.exception.has_payload; events->exception_payload = vcpu->arch.exception.payload; - events->interrupt.injected = - vcpu->arch.interrupt.injected && !vcpu->arch.interrupt.soft; + events->interrupt.injected = vcpu->arch.interrupt.injected && + !vcpu->arch.interrupt.soft && + !drop_injected_events; events->interrupt.nr = vcpu->arch.interrupt.nr; events->interrupt.soft = 0; events->interrupt.shadow = static_call(kvm_x86_get_interrupt_shadow)(vcpu); - events->nmi.injected = vcpu->arch.nmi_injected; + events->nmi.injected = vcpu->arch.nmi_injected && !drop_injected_events; events->nmi.pending = vcpu->arch.nmi_pending != 0; events->nmi.masked = static_call(kvm_x86_get_nmi_mask)(vcpu); events->nmi.pad = 0;