Re: #DE

Paolo Bonzini <pbonzini@xxxxxxxxxx> · Mon, 25 May 2020 18:45:02 +0200

On 25/05/20 17:13, Maxim Levitsky wrote:
> With all this said, this is what is happening:
> 
> 1. The host sets interrupt window. It needs interrupts window because (I guess) that guest
> disabled interrupts and it waits till interrupts are enabled to inject the interrupt.
> 
> To be honest this is VMX limitation, and in SVM (I think according to the spec) you can inject
> interrupts whenever you want and if the guest can't accept then interrupt, the vector written to EVENTINJ field,
> will be moved to V_INTR_VECTOR and V_INTR flag will be set,

Not exactly, EVENTINJ ignores the interrupt flag and everything else.  
But yes, you can inject the interrupt any time you want using V_IRQ + 
V_INTR_VECTOR and it will be injected by the processor.  This is a 
very interesting feature but it doesn't fit very well in the KVM
architecture so we're not using it.  Hyper-V does (and it is also
why it broke mercilessly).

> 2. Since SVM doesn't really have a concept of interrupt window 
> intercept, this is faked by setting V_INTR, as if injected (or as
> they call it virtual) interrupt is pending, together with intercept
> of virtual interrupts,
Correct.

> 4. After we enter the nested guest, we eventually get an VMexit due
> to unrelated reason and we sync the V_INTR that *we* set
> to the nested vmcs, since in theory that flag could have beeing set
> by the CPU itself, if the guest itself used EVENTINJ to inject
> an event to its nested guest while the nested guest didn't have
> interrupts enabled (KVM doesn't do this, but we can't assume that)

I suppose you mean in sync_nested_vmcb_control.  Here, in the latest version,
we have:

        mask = V_IRQ_MASK | V_TPR_MASK;
        if (!(svm->nested.ctl.int_ctl & V_INTR_MASKING_MASK) &&
            is_intercept(svm, SVM_EXIT_VINTR)) {
                /*
                 * In order to request an interrupt window, L0 is usurping
                 * svm->vmcb->control.int_ctl and possibly setting V_IRQ
                 * even if it was clear in L1's VMCB.  Restoring it would be
                 * wrong.  However, in this case V_IRQ will remain true until
                 * interrupt_window_interception calls svm_clear_vintr and
                 * restores int_ctl.  We can just leave it aside.
                 */
                mask &= ~V_IRQ_MASK;
        }

and svm_clear_vintr will copy V_IRQ, V_INTR_VECTOR, etc. back from the nested_vmcb
into svm->vmcb.  Is that enough to fix the bug?

> 5. At that point the bomb is ticking. Once the guest ends dealing
> with the nested guest VMexit, and executes VMRUN, we enter the nested
> guest with V_INTR enabled. V_INTER intercept is disabled since we
> disabled our interrupt window long ago, guest is also currently
> doesn't enable any interrupt window, so we basically injecting to the
> guest whatever is there in V_INTR_VECTOR in the nested guest's VMCB.

Yep, this sounds correct.  The question is whether it can still happen
in the latest version of the code, where I tried to think more about who
owns which int_ctl bits when.

> Now that I am thinking about this the issue is deeper that I thought
> and it stems from the abuse of the V_INTR on AMD. IMHO the best
> solution is to avoid interrupt windows on AMD unless really needed
> (like with level-triggered interrupts or so)

Yes, that is the root cause.  However I'm not sure it would be that
much simpler if we didn't abuse VINTR like that, because INT_CTL is a
very complicated field.

> Now the problem is that it is next to impossible to know the source
> of the VINTR pending flag. Even if we remember that host is currently
> setup an interrupt window, the guest afterwards could have used
> EVENTINJ + interrupt disabled nested guest, to raise that flag as
> well, and might need to know about it.

Actually it is possible!  is_intercept tests L0's VINTR intercept
(see get_host_vmcb in svm.h), and that will be true if and only if
we are abusing the V_IRQ/V_INTR_PRIO/V_INTR_VECTOR fields.

Furthermore, L0 should not use VINTR during nested VMRUN only if both
the following conditions are true:

- L1 is not using V_INTR_MASKING

- L0 has a pending interrupt (of course)

This is because when virtual interrupts are in use, you can inject
physical interrupts into L1 at any time by taking an EXIT_INTR vmexit.

My theory as to why the bug could happen involved a race between
the call to kvm_x86_ops.interrupt_allowed(vcpu, true) in
inject_pending_event and the call to kvm_cpu_has_injectable_intr
in vcpu_enter_guest.  Again, that one should be fixed in the
latest version on the branch, but there could be more than one bug!

> I have an idea on how to fix this, which is about 99% correct and will only fail if the guest attempt something that
> is undefined anyway.
> 
> Lets set the vector of the fake VINTR to some reserved exception value, rather that 0 (which the guest is not supposed to inject ever to the nested guest),
> so then we will know if the VINTR is from our fake injection or from the guest itself.
> If it is our VINTR then we will not sync it to the guest.
> In theory it can be even 0, since exceptions should never be injected as interrupts anyway, this is also reserved operation.

Unfortunately these are interrupts, not exceptions.  You _can_ configure
the PIC (or probably the IOAPIC too) to inject vectors in the 0-31 range.
Are you old enough to remember INT 8 as the timer interrupt? :)

Thanks very much for the detective work though!  You made a good walkthrough
overall so you definitely understood good parts of the code.

Paolo