On Wed, May 11, 2011, Gleb Natapov wrote about "Re: [PATCH 0/30] nVMX: Nested VMX, v9": > On Mon, May 09, 2011 at 02:18:25PM +0300, Avi Kivity wrote: >.. > > We can probably merge the next iteration if there aren't significant > > comments from others. >.. > > The only worrying thing is the issue you raise in patch 8. Is there > > a simple fix you can push that addresses correctness? Hi Avi, I have fixed the few issues you raised in this round, including a rewritten patch 8 that doesn't leave anything as a TODO. I am ready to send another round of patches once we decide what to do about the issue raised by Gleb yesterday: > I still feel that interrupt injection path should be reworked to be > SVM like before merging the code. Both I and my colleague Abel Gordon reviewed the relevant code-path in nested VMX and nested SVM, and carefully (re)read the relevant parts of the VMX spec, and we have come to several conclusions. Avi and Gleb, I'd love to hear your comments on these issues: Our first conclusion is that my description of our interrupt injection patch was unduely negative. It wrongly suggested that our solution of using the interrupt window was inefficient, and was just an ad-interim plug until a cleaner solution could be written. This was wrong. In fact, the somewhat strange exit and interrupt window that happens when injecting an interrupt to L2 are *necessary* for accurate emulation. I explain why this is so in the new version of this patch, which is included below. Please take a look. Our second conclusion (and I hope that I'm not offending anyone here) is that the changes for L2 interrupt injection in both SVM and VMX are both ugly - they are just ugly in different ways. Both modified the non-nested code in strange places in strange and unexpected ways, and tried to circumvent the usual code path in x86.c without touching x86.c. They just did this in two slightly different ways, neither (I think) is inherently uglier than the other: For accurate emulation (as I explain in the patch below), both codes need to cause x86.c to change its normal behavior: It checks for interrupt_allowed() and then (discovering that it isn't) enable_irq_window(). We want it to instead exit to L1, and then enable the irq window on that. In the SVM code, interrupt_allowed() is modified to always return false if nested, and enable_irq_window() is modified to flag for an exit to L1 (which is performed later) and turn on the interrupt window. In VMX, we modify the same places but differently: In interrupt_allowed() we exit to L1 immediately (it's a short operation, we didn't mind to do it in atomic context), and enable_irq_window() doesn't need to be changed (it already runs in L1). Continuing to survey the difference between nested VMX and and SVM, there were other different choices made besides the ones mentioned above. nested SVM uses an additional trick, of skipping one round of running the guest, when it discovered the need for an exit in the "wrong" place, so it can get to the "right" place again. Nested VMX solved the same problems with other mechanisms, like a separate piece of code for handling IDT_VECTORING_INFO, and nested_run_pending. Some differences can also be explained by the different design of (non-nested) vmx.c vs svm.c - e.g., svm_complete_interrupts() is called during the handle_exit(), while vmx_complete_interrupts() is called after handle_exit() has completed (in atomic context) - this is one of the reasons the nested IDT_VECTORING_INFO path is different. I think that both solutions are far from being beautiful or easy to understand. Nested SVM is perhaps slightly less ugly but also has a small performance cost (with the extra vcpu_run iteration doing nothing) - and I think neither is inherently better than the other. So I guess my question is, and Avi and Gleb I'd love your comments about this question: Is it really beneficial that I rewrite the "ugly" nested-VMX injection code to be somewhat-ugly in exactly the same way that nested-SVM injection code? Won't it be more beneficial to rewrite *both* codes to be cleaner? This would probably mean changes to the common x86.c, that both will use. For example, x86.c's injection code could check the nested case itself, perhaps calling a special x86_op to handle the nested injection (exit, set interrupt window, etc.) instead of calling the regular interrupt_allowed/enable_irq_window and forcing those to be modified in mysterious ways. Now that there's a is_guest_mode(vcpu) function, more nested-related code can be moved to x86.c, to make both nested VMX and nested SVM code cleaner. Waiting to hear your opinions and suggestions, Thanks, Nadav. ============================================================= Subject: [PATCH 23/31] nVMX: Correct handling of interrupt injection The code in this patch correctly emulates external-interrupt injection while a nested guest L2 is running. Because of this code's relative oddity and un-obviousness, I include here a longer-than-usual justification for what it does - longer than the code itself ;-) To understand how to correctly emulate interrupt injection while L2 is running, let's look first at what we need to emulate: How would things look like if the extra L0 hypervisor layer is removed, and instead of L0 injecting an interrupt we have hardware delivering an interrupt? Now L1 runs on bare metal, with a guest L2 and the hardware generates an interrupt. Assuming that L1 set PIN_BASED_EXT_INTR_MASK to 1, and VM_EXIT_ACK_INTR_ON_EXIT to 0 (we'll revisit these assumptions below), what happens now is this: The processor exits from L2 to L1, with an external-interrupt exit reason but without an interrupt vector. L1 runs, with interrupts disabled, and it doesn't yet know what the interrupt was. Soon after, it enables interrupts and only at that moment, it gets the interrupt from the processor. when L1 is KVM, Linux handles this interrupt. Now we need exactly the same thing to happen when that L1->L2 system runs on top of L0, instead of real hardware. This is how we do this: When L0 wants to inject an interrupt, it needs to exit from L2 to L1, with external-interrupt exit reason (without an interrupt vector), and run L1. Just like in the bare metal case, it likely can't deliver the interrupt to L1 now because it is running with interrupts disabled, in which case it turns on the interrupt window when running L1 after the exit. L1 will soon enable interrupts, and at that point L0 will gain control again and inject the interrupt to L1. Finally, there is an extra complication in the code: when nested_run_pending, we cannot return to L1 now, and must launch L2. We need to remember the interrupt we wanted to inject (and not clear it now), and do it on the next exit. The above explanation shows that the relative strangeness of the nested interrupt injection code in this patch, and the extra interrupt-window exit incurred, are in fact necessary for accurate emulation, and are not just an unoptimized implementation. Let's revisit now the two assumptions made above: If L1 turns off PIN_BASED_EXT_INTR_MASK (no hypervisor that I know does, by the way), things are simple: L0 may inject the interrupt directly to the L2 guest - using the normal code path that injects to any guest. We support this case in the code below. If L1 turns on VM_EXIT_ACK_INTR_ON_EXIT (again, no hypervisor that I know does), things look very different from the description above: L1 expects to see an exit from L2 with the interrupt vector already filled in the exit information, and does not expect to be interrupted again with this interrupt. The current code does not (yet) support this case, so we do not allow the VM_EXIT_ACK_INTR_ON_EXIT exit-control to be turned on by L1. Signed-off-by: Nadav Har'El <nyh@xxxxxxxxxx> --- arch/x86/kvm/vmx.c | 37 +++++++++++++++++++++++++++++++++++++ 1 file changed, 37 insertions(+) --- .before/arch/x86/kvm/vmx.c 2011-05-12 17:39:33.000000000 +0300 +++ .after/arch/x86/kvm/vmx.c 2011-05-12 17:39:33.000000000 +0300 @@ -3703,9 +3703,25 @@ out: return ret; } +/* + * In nested virtualization, check if L1 asked to exit on external interrupts. + * For most existing hypervisors, this will always return true. + */ +static bool nested_exit_on_intr(struct kvm_vcpu *vcpu) +{ + return get_vmcs12(vcpu)->pin_based_vm_exec_control & + PIN_BASED_EXT_INTR_MASK; +} + static void enable_irq_window(struct kvm_vcpu *vcpu) { u32 cpu_based_vm_exec_control; + if (is_guest_mode(vcpu) && nested_exit_on_intr(vcpu)) + /* We can get here when nested_run_pending caused + * vmx_interrupt_allowed() to return false. In this case, do + * nothing - the interrupt will be injected later. + */ + return; cpu_based_vm_exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL); cpu_based_vm_exec_control |= CPU_BASED_VIRTUAL_INTR_PENDING; @@ -3828,6 +3844,15 @@ static void vmx_set_nmi_mask(struct kvm_ static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu) { + if (is_guest_mode(vcpu) && nested_exit_on_intr(vcpu)) { + if (to_vmx(vcpu)->nested.nested_run_pending) + return 0; + nested_vmx_vmexit(vcpu); + get_vmcs12(vcpu)->vm_exit_reason = + EXIT_REASON_EXTERNAL_INTERRUPT; + /* fall through to normal code, but now in L1, not L2 */ + } + return (vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF) && !(vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) & (GUEST_INTR_STATE_STI | GUEST_INTR_STATE_MOV_SS)); @@ -5515,6 +5540,14 @@ static int vmx_handle_exit(struct kvm_vc if (vmx->emulation_required && emulate_invalid_guest_state) return handle_invalid_guest_state(vcpu); + /* + * the KVM_REQ_EVENT optimization bit is only on for one entry, and if + * we did not inject a still-pending event to L1 now because of + * nested_run_pending, we need to re-enable this bit. + */ + if (vmx->nested.nested_run_pending) + kvm_make_request(KVM_REQ_EVENT, vcpu); + if (exit_reason == EXIT_REASON_VMLAUNCH || exit_reason == EXIT_REASON_VMRESUME) vmx->nested.nested_run_pending = 1; @@ -5712,6 +5745,8 @@ static void __vmx_complete_interrupts(st static void vmx_complete_interrupts(struct vcpu_vmx *vmx) { + if (is_guest_mode(&vmx->vcpu)) + return; __vmx_complete_interrupts(vmx, vmx->idt_vectoring_info, VM_EXIT_INSTRUCTION_LEN, IDT_VECTORING_ERROR_CODE); @@ -5719,6 +5754,8 @@ static void vmx_complete_interrupts(stru static void vmx_cancel_injection(struct kvm_vcpu *vcpu) { + if (is_guest_mode(vcpu)) + return; __vmx_complete_interrupts(to_vmx(vcpu), vmcs_read32(VM_ENTRY_INTR_INFO_FIELD), VM_ENTRY_INSTRUCTION_LEN, -- Nadav Har'El | Thursday, May 12 2011, 8 Iyyar 5771 nyh@xxxxxxxxxxxxxxxxxxx |----------------------------------------- Phone +972-523-790466, ICQ 13349191 |Linux: Because rebooting is for adding http://nadav.harel.org.il |new hardware. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html