On Sun, Mar 17, 2013 at 04:02:07PM +0100, Jan Kiszka wrote: > On 2013-03-17 14:45, Gleb Natapov wrote: > > On Sat, Mar 16, 2013 at 11:23:16AM +0100, Jan Kiszka wrote: > >> From: Jan Kiszka <jan.kiszka@xxxxxxxxxxx> > >> > >> The basic idea is to always transfer the pending event injection on > >> vmexit into the architectural state of the VCPU and then drop it from > >> there if it turns out that we left L2 to enter L1. > >> > >> VMX and SVM are now identical in how they recover event injections from > >> unperformed vmlaunch/vmresume: We detect that VM_ENTRY_INTR_INFO_FIELD > >> still contains a valid event and, if yes, transfer the content into L1's > >> idt_vectoring_info_field. > >> > > But how this can happens with VMX code? VMX has this nested_run_pending > > thing that prevents #vmexit emulation from happening without vmlaunch. > > This means that VM_ENTRY_INTR_INFO_FIELD should never be valid during > > #vmexit emulation since it is marked invalid during vmlaunch. > > Now that nmi/interrupt_allowed is strict /wrt nested_run_pending again, > it may indeed no longer happen. It was definitely a problem before, also > with direct vmexit on pending INIT. Requires a second thought, maybe > also a WARN_ON(vmx->nested.nested_run_pending) in nested_vmx_vmexit. > > > > >> However, we differ on how to deal with events that L0 wanted to inject > >> into L2. Likely, this case is still broken in SVM. For VMX, the function > >> vmcs12_save_pending_events deals with transferring pending L0 events > >> into the queue of L1. That is mandatory as L1 may decide to switch the > >> guest state completely, invalidating or preserving the pending events > >> for later injection (including on a different node, once we support > >> migration). > >> > >> Note that we treat directly injected NMIs differently as they can hit > >> both L1 and L2. In this case, we let L0 try to injection again also over > >> L1 after leaving L2. > >> > > Hmm, where SDM says NMI behaves this way? > > NMIs are only blocked in root mode if we took an NMI-related vmexit (or, > of course, an NMI is being processed). Thus, every arriving NMI can > either hit the guest or the host - pure luck. > > However, I have missed the fact that an NMI may have been injected from > L1 as well. If injection triggers a vmexit, that NMI could now leak into > L1. So we have to process them as well in vmcs12_save_pending_events. > You mean "should not leak into L0" not L1? > > > >> To avoid that we incorrectly leak an event into the architectural VCPU > >> state that L1 wants to inject, we skip cancellation on nested run. > >> > > How the leak can happen? > > See above, this likely no longer applies. > > > > >> Signed-off-by: Jan Kiszka <jan.kiszka@xxxxxxxxxxx> > >> --- > >> arch/x86/kvm/vmx.c | 118 ++++++++++++++++++++++++++++++++++++++-------------- > >> 1 files changed, 87 insertions(+), 31 deletions(-) > >> > >> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c > >> index 126d047..ca74358 100644 > >> --- a/arch/x86/kvm/vmx.c > >> +++ b/arch/x86/kvm/vmx.c > >> @@ -6492,8 +6492,6 @@ static void __vmx_complete_interrupts(struct kvm_vcpu *vcpu, > >> > >> static void vmx_complete_interrupts(struct vcpu_vmx *vmx) > >> { > >> - if (is_guest_mode(&vmx->vcpu)) > >> - return; > >> __vmx_complete_interrupts(&vmx->vcpu, vmx->idt_vectoring_info, > >> VM_EXIT_INSTRUCTION_LEN, > >> IDT_VECTORING_ERROR_CODE); > >> @@ -6501,7 +6499,7 @@ static void vmx_complete_interrupts(struct vcpu_vmx *vmx) > >> > >> static void vmx_cancel_injection(struct kvm_vcpu *vcpu) > >> { > >> - if (is_guest_mode(vcpu)) > >> + if (to_vmx(vcpu)->nested.nested_run_pending) > >> return; > >> __vmx_complete_interrupts(vcpu, > >> vmcs_read32(VM_ENTRY_INTR_INFO_FIELD), > >> @@ -6534,21 +6532,6 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu) > >> struct vcpu_vmx *vmx = to_vmx(vcpu); > >> unsigned long debugctlmsr; > >> > >> - if (is_guest_mode(vcpu) && !vmx->nested.nested_run_pending) { > >> - struct vmcs12 *vmcs12 = get_vmcs12(vcpu); > >> - if (vmcs12->idt_vectoring_info_field & > >> - VECTORING_INFO_VALID_MASK) { > >> - vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, > >> - vmcs12->idt_vectoring_info_field); > >> - vmcs_write32(VM_ENTRY_INSTRUCTION_LEN, > >> - vmcs12->vm_exit_instruction_len); > >> - if (vmcs12->idt_vectoring_info_field & > >> - VECTORING_INFO_DELIVER_CODE_MASK) > >> - vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE, > >> - vmcs12->idt_vectoring_error_code); > >> - } > >> - } > >> - > >> /* Record the guest's net vcpu time for enforced NMI injections. */ > >> if (unlikely(!cpu_has_virtual_nmis() && vmx->soft_vnmi_blocked)) > >> vmx->entry_time = ktime_get(); > >> @@ -6707,17 +6690,6 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu) > >> > >> vmx->idt_vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD); > >> > >> - if (is_guest_mode(vcpu)) { > >> - struct vmcs12 *vmcs12 = get_vmcs12(vcpu); > >> - vmcs12->idt_vectoring_info_field = vmx->idt_vectoring_info; > >> - if (vmx->idt_vectoring_info & VECTORING_INFO_VALID_MASK) { > >> - vmcs12->idt_vectoring_error_code = > >> - vmcs_read32(IDT_VECTORING_ERROR_CODE); > >> - vmcs12->vm_exit_instruction_len = > >> - vmcs_read32(VM_EXIT_INSTRUCTION_LEN); > >> - } > >> - } > >> - > >> vmx->loaded_vmcs->launched = 1; > >> > >> vmx->exit_reason = vmcs_read32(VM_EXIT_REASON); > >> @@ -7324,6 +7296,52 @@ vmcs12_guest_cr4(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) > >> vcpu->arch.cr4_guest_owned_bits)); > >> } > >> > >> +static void vmcs12_save_pending_events(struct kvm_vcpu *vcpu, > >> + struct vmcs12 *vmcs12) > >> +{ > >> + u32 idt_vectoring; > >> + unsigned int nr; > >> + > >> + /* > >> + * We only transfer exceptions and maskable interrupts. It is fine if > >> + * L0 retries to inject a pending NMI over L1. > >> + */ > >> + if (vcpu->arch.exception.pending) { > >> + nr = vcpu->arch.exception.nr; > >> + idt_vectoring = nr | VECTORING_INFO_VALID_MASK; > >> + > >> + if (kvm_exception_is_soft(nr)) { > >> + vmcs12->vm_exit_instruction_len = > >> + vcpu->arch.event_exit_inst_len; > >> + idt_vectoring |= INTR_TYPE_SOFT_EXCEPTION; > >> + } else > >> + idt_vectoring |= INTR_TYPE_HARD_EXCEPTION; > >> + > >> + if (vcpu->arch.exception.has_error_code) { > >> + idt_vectoring |= VECTORING_INFO_DELIVER_CODE_MASK; > >> + vmcs12->idt_vectoring_error_code = > >> + vcpu->arch.exception.error_code; > >> + } > >> + > >> + vmcs12->idt_vectoring_info_field = idt_vectoring; > >> + } else if (vcpu->arch.interrupt.pending) { > >> + nr = vcpu->arch.interrupt.nr; > >> + idt_vectoring = nr | VECTORING_INFO_VALID_MASK; > >> + > >> + if (vcpu->arch.interrupt.soft) { > >> + idt_vectoring |= INTR_TYPE_SOFT_INTR; > >> + vmcs12->vm_entry_instruction_len = > >> + vcpu->arch.event_exit_inst_len; > >> + } else > >> + idt_vectoring |= INTR_TYPE_EXT_INTR; > >> + > >> + vmcs12->idt_vectoring_info_field = idt_vectoring; > >> + } > >> + > >> + kvm_clear_exception_queue(vcpu); > >> + kvm_clear_interrupt_queue(vcpu); > >> +} > >> + > >> /* > >> * prepare_vmcs12 is part of what we need to do when the nested L2 guest exits > >> * and we want to prepare to run its L1 parent. L1 keeps a vmcs for L2 (vmcs12), > >> @@ -7415,9 +7433,47 @@ static void prepare_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) > >> vmcs12->vm_exit_instruction_len = vmcs_read32(VM_EXIT_INSTRUCTION_LEN); > >> vmcs12->vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO); > >> > >> - /* clear vm-entry fields which are to be cleared on exit */ > >> - if (!(vmcs12->vm_exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY)) > >> + if (!(vmcs12->vm_exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY)) { > >> + if ((vmcs12->vm_entry_intr_info_field & > >> + INTR_INFO_VALID_MASK) && > >> + (vmcs_read32(VM_ENTRY_INTR_INFO_FIELD) & > >> + INTR_INFO_VALID_MASK)) { > > Again I do not see how this condition can be true. > > > >> + /* > >> + * Preserve the event that was supposed to be injected > >> + * by L1 via emulating it would have been returned in > >> + * IDT_VECTORING_INFO_FIELD. > >> + */ > >> + vmcs12->idt_vectoring_info_field = > >> + vmcs12->vm_entry_intr_info_field; > >> + vmcs12->idt_vectoring_error_code = > >> + vmcs12->vm_entry_exception_error_code; > >> + vmcs12->vm_exit_instruction_len = > >> + vmcs12->vm_entry_instruction_len; > >> + vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, 0); > >> + > >> + /* > >> + * We do not drop NMIs that targeted L2 below as they > >> + * can also be reinjected over L1. But if this event > >> + * was an NMI, it was synthetic and came from L1. > >> + */ > >> + vcpu->arch.nmi_injected = false; > >> + } else > >> + /* > >> + * Transfer the event L0 may wanted to inject into L2 > >> + * to IDT_VECTORING_INFO_FIELD. > >> + */ > > I do not understand the comment. This transfers an event from event queue into vmcs12. > > Since vmx_complete_interrupts() transfers event that L1 tried to inject > > into event queue too he we handle not only L0->L2, but also L1->L2 > > events too. > > I'm not sure if I fully understand your remark. Is it that the comment > is only talking about L0 events? That is indeed not fully true, L1 > events should make it to the architectural queue as well. Will adjust this. > Yes, I was referring to the comment mentioning L0 only. > > In fast I think only "else" part of this if() is needed. > > Yes, probably. > > Thanks, > Jan > > -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html