On 25/05/20 17:13, Maxim Levitsky wrote: > With all this said, this is what is happening: > > 1. The host sets interrupt window. It needs interrupts window because (I guess) that guest > disabled interrupts and it waits till interrupts are enabled to inject the interrupt. > > To be honest this is VMX limitation, and in SVM (I think according to the spec) you can inject > interrupts whenever you want and if the guest can't accept then interrupt, the vector written to EVENTINJ field, > will be moved to V_INTR_VECTOR and V_INTR flag will be set, Not exactly, EVENTINJ ignores the interrupt flag and everything else. But yes, you can inject the interrupt any time you want using V_IRQ + V_INTR_VECTOR and it will be injected by the processor. This is a very interesting feature but it doesn't fit very well in the KVM architecture so we're not using it. Hyper-V does (and it is also why it broke mercilessly). > 2. Since SVM doesn't really have a concept of interrupt window > intercept, this is faked by setting V_INTR, as if injected (or as > they call it virtual) interrupt is pending, together with intercept > of virtual interrupts, Correct. > 4. After we enter the nested guest, we eventually get an VMexit due > to unrelated reason and we sync the V_INTR that *we* set > to the nested vmcs, since in theory that flag could have beeing set > by the CPU itself, if the guest itself used EVENTINJ to inject > an event to its nested guest while the nested guest didn't have > interrupts enabled (KVM doesn't do this, but we can't assume that) I suppose you mean in sync_nested_vmcb_control. Here, in the latest version, we have: mask = V_IRQ_MASK | V_TPR_MASK; if (!(svm->nested.ctl.int_ctl & V_INTR_MASKING_MASK) && is_intercept(svm, SVM_EXIT_VINTR)) { /* * In order to request an interrupt window, L0 is usurping * svm->vmcb->control.int_ctl and possibly setting V_IRQ * even if it was clear in L1's VMCB. Restoring it would be * wrong. However, in this case V_IRQ will remain true until * interrupt_window_interception calls svm_clear_vintr and * restores int_ctl. We can just leave it aside. */ mask &= ~V_IRQ_MASK; } and svm_clear_vintr will copy V_IRQ, V_INTR_VECTOR, etc. back from the nested_vmcb into svm->vmcb. Is that enough to fix the bug? > 5. At that point the bomb is ticking. Once the guest ends dealing > with the nested guest VMexit, and executes VMRUN, we enter the nested > guest with V_INTR enabled. V_INTER intercept is disabled since we > disabled our interrupt window long ago, guest is also currently > doesn't enable any interrupt window, so we basically injecting to the > guest whatever is there in V_INTR_VECTOR in the nested guest's VMCB. Yep, this sounds correct. The question is whether it can still happen in the latest version of the code, where I tried to think more about who owns which int_ctl bits when. > Now that I am thinking about this the issue is deeper that I thought > and it stems from the abuse of the V_INTR on AMD. IMHO the best > solution is to avoid interrupt windows on AMD unless really needed > (like with level-triggered interrupts or so) Yes, that is the root cause. However I'm not sure it would be that much simpler if we didn't abuse VINTR like that, because INT_CTL is a very complicated field. > Now the problem is that it is next to impossible to know the source > of the VINTR pending flag. Even if we remember that host is currently > setup an interrupt window, the guest afterwards could have used > EVENTINJ + interrupt disabled nested guest, to raise that flag as > well, and might need to know about it. Actually it is possible! is_intercept tests L0's VINTR intercept (see get_host_vmcb in svm.h), and that will be true if and only if we are abusing the V_IRQ/V_INTR_PRIO/V_INTR_VECTOR fields. Furthermore, L0 should not use VINTR during nested VMRUN only if both the following conditions are true: - L1 is not using V_INTR_MASKING - L0 has a pending interrupt (of course) This is because when virtual interrupts are in use, you can inject physical interrupts into L1 at any time by taking an EXIT_INTR vmexit. My theory as to why the bug could happen involved a race between the call to kvm_x86_ops.interrupt_allowed(vcpu, true) in inject_pending_event and the call to kvm_cpu_has_injectable_intr in vcpu_enter_guest. Again, that one should be fixed in the latest version on the branch, but there could be more than one bug! > I have an idea on how to fix this, which is about 99% correct and will only fail if the guest attempt something that > is undefined anyway. > > Lets set the vector of the fake VINTR to some reserved exception value, rather that 0 (which the guest is not supposed to inject ever to the nested guest), > so then we will know if the VINTR is from our fake injection or from the guest itself. > If it is our VINTR then we will not sync it to the guest. > In theory it can be even 0, since exceptions should never be injected as interrupts anyway, this is also reserved operation. Unfortunately these are interrupts, not exceptions. You _can_ configure the PIC (or probably the IOAPIC too) to inject vectors in the 0-31 range. Are you old enough to remember INT 8 as the timer interrupt? :) Thanks very much for the detective work though! You made a good walkthrough overall so you definitely understood good parts of the code. Paolo