On 01/04/21 16:38, Maxim Levitsky wrote:
+static int kvm_do_deliver_pending_exception(struct kvm_vcpu *vcpu) +{ + int class1, class2, ret; + + /* try to deliver current pending exception as VM exit */ + if (is_guest_mode(vcpu)) { + ret = kvm_x86_ops.nested_ops->deliver_exception_as_vmexit(vcpu); + if (ret || !vcpu->arch.pending_exception.valid) + return ret; + } + + /* No injected exception, so just deliver the payload and inject it */ + if (!vcpu->arch.injected_exception.valid) { + trace_kvm_inj_exception(vcpu->arch.pending_exception.nr, + vcpu->arch.pending_exception.has_error_code, + vcpu->arch.pending_exception.error_code); +queue:
If you move the queue label to the top of the function, you can "goto queue" for #DF as well and you don't need to call kvm_do_deliver_pending_exception again. In fact you can merge this function and kvm_deliver_pending_exception completely: static int kvm_deliver_pending_exception_as_vmexit(struct kvm_vcpu *vcpu) { WARN_ON(!vcpu->arch.pending_exception.valid); if (is_guest_mode(vcpu)) return kvm_x86_ops.nested_ops->deliver_exception_as_vmexit(vcpu); else return 0; } static int kvm_merge_injected_exception(struct kvm_vcpu *vcpu) { /* * First check if the pending exception takes precedence * over the injected one, which will be reported in the * vmexit info. */ ret = kvm_deliver_pending_exception_as_vmexit(vcpu); if (ret || !vcpu->arch.pending_exception.valid) return ret; if (vcpu->arch.injected_exception.nr == DF_VECTOR) { ... return 0; } ... if ((class1 == EXCPT_CONTRIBUTORY && class2 == EXCPT_CONTRIBUTORY) || (class1 == EXCPT_PF && class2 != EXCPT_BENIGN)) { ... } vcpu->arch.injected_exception.valid = false; } static int kvm_deliver_pending_exception(struct kvm_vcpu *vcpu) { if (!vcpu->arch.pending_exception.valid) return 0; if (vcpu->arch.injected_exception.valid) kvm_merge_injected_exception(vcpu); ret = kvm_deliver_pending_exception_as_vmexit(vcpu)); if (ret || !vcpu->arch.pending_exception.valid) return ret; trace_kvm_inj_exception(vcpu->arch.pending_exception.nr, vcpu->arch.pending_exception.has_error_code, vcpu->arch.pending_exception.error_code); ... } Note that if the pending exception is a page fault, its payload must be delivered to CR2 before converting it to a double fault. Going forward to vmx.c:
if (mtf_pending) { if (block_nested_events) return -EBUSY; + nested_vmx_update_pending_dbg(vcpu);
Should this instead "WARN_ON(vmx_pending_dbg_trap(vcpu));" since the pending-#DB-plus-MTF combination is handled in vmx_deliver_exception_as_vmexit?...
+ + if (vmx->nested.mtf_pending && vmx_pending_dbg_trap(vcpu)) { + /* + * A pending monitor trap takes precedence over pending + * debug exception which is 'stashed' into + * 'GUEST_PENDING_DBG_EXCEPTIONS' + */ + + nested_vmx_update_pending_dbg(vcpu); + vmx->nested.mtf_pending = false; + nested_vmx_vmexit(vcpu, EXIT_REASON_MONITOR_TRAP_FLAG, 0, 0); + return 0; + }
... though this is quite ugly, even more so if you add the case of an INIT with a pending #DB. The problem is that INIT and MTF have higher priority than debug exceptions. The good thing is that an INIT or MTF plus #DB cannot happen with nested_run_pending == 1, so it will always be inject right away. There is precedent with KVM_GET_* modifying the CPU state; in particular, KVM_GET_MPSTATE can modify CS and RIP and even cause a vmexit via kvm_apic_accept_events. And in fact, because kvm_apic_accept_events calls kvm_check_nested_events, calling it from KVM_GET_VCPU_EVENTS would fix the problem: the injected exception would go into the IDT-vectored exit field, while the pending exception would go into GUEST_PENDING_DBG_EXCEPTION and disappear. However, you cannot do kvm_apic_accept_events twice because there would be a race with INIT: a #DB exception could be first stored by KVM_GET_VCPU_EVENTS, then disappear when kvm_apic_accept_events KVM_GET_MPSTATE is called. Fortunately, the correct order for KVM_GET_* events is first KVM_GET_VCPU_EVENTS and then KVM_GET_MPSTATE. So perhaps instead of calling kvm_deliver_pending_exception on vmexit, KVM_GET_VCPU_EVENTS could call kvm_apic_accept_events (and thus kvm_check_nested_events) instead of KVM_GET_MPSTATE. In addition: nested_ops.check_events would be split in two, with high-priority events (triple fault, now in kvm_check_nested_events; INIT; MTF) remaining in the first and interrupts in the second, tentatively named check_interrupts). I'll take a look at cleaning up the current mess we have around kvm_apic_accept_events, where most of the calls do not have to handle guest mode at all. Thanks for this work and for showing that it's possible to fix the underlying mess with exception vmexit. It may be a bit more complicated than this, but it's a great start. Paolo