Re: [PATCH] KVM: nVMX: Rework event injection and recovery

Jan Kiszka <jan.kiszka@xxxxxxxxxxx> · Thu, 21 Feb 2013 11:18:58 +0100

On 2013-02-21 11:06, Gleb Natapov wrote:
> On Thu, Feb 21, 2013 at 10:43:57AM +0100, Jan Kiszka wrote:
>> On 2013-02-21 10:22, Gleb Natapov wrote:
>>> On Wed, Feb 20, 2013 at 06:50:50PM +0100, Jan Kiszka wrote:
>>>> On 2013-02-20 18:24, Jan Kiszka wrote:
>>>>> On 2013-02-20 18:01, Gleb Natapov wrote:
>>>>>> On Wed, Feb 20, 2013 at 03:37:51PM +0100, Jan Kiszka wrote:
>>>>>>> On 2013-02-20 15:14, Nadav Har'El wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> By the way, if you haven't seen my description of why the current code
>>>>>>>> did what it did, take a look at
>>>>>>>> http://www.mail-archive.com/kvm@xxxxxxxxxxxxxxx/msg54478.html
>>>>>>>> Another description might also come in handy:
>>>>>>>> http://www.mail-archive.com/kvm@xxxxxxxxxxxxxxx/msg54476.html
>>>>>>>>
>>>>>>>> On Wed, Feb 20, 2013, Jan Kiszka wrote about "[PATCH] KVM: nVMX: Rework event injection and recovery":
>>>>>>>>> This aligns VMX more with SVM regarding event injection and recovery for
>>>>>>>>> nested guests. The changes allow to inject interrupts directly from L0
>>>>>>>>> to L2.
>>>>>>>>>
>>>>>>>>> One difference to SVM is that we always transfer the pending event
>>>>>>>>> injection into the architectural state of the VCPU and then drop it from
>>>>>>>>> there if it turns out that we left L2 to enter L1.
>>>>>>>>
>>>>>>>> Last time I checked, if I'm remembering correctly, the nested SVM code did
>>>>>>>> something a bit different: After the exit from L2 to L1 and unnecessarily
>>>>>>>> queuing the pending interrupt for injection, it skipped one entry into L1,
>>>>>>>> and as usual after the entry the interrupt queue is cleared so next time
>>>>>>>> around, when L1 one is really entered, the wrong injection is not attempted.
>>>>>>>>
>>>>>>>>> VMX and SVM are now identical in how they recover event injections from
>>>>>>>>> unperformed vmlaunch/vmresume: We detect that VM_ENTRY_INTR_INFO_FIELD
>>>>>>>>> still contains a valid event and, if yes, transfer the content into L1's
>>>>>>>>> idt_vectoring_info_field.
>>>>>>>>
>>>>>>>>> To avoid that we incorrectly leak an event into the architectural VCPU
>>>>>>>>> state that L1 wants to inject, we skip cancellation on nested run.
>>>>>>>>
>>>>>>>> I didn't understand this last point.
>>>>>>>
>>>>>>> - prepare_vmcs02 sets event to be injected into L2
>>>>>>> - while trying to enter L2, a cancel condition is met
>>>>>>> - we call vmx_cancel_interrupts but should now avoid filling L1's event
>>>>>>>   into the arch event queues - it's kept in vmcs12
>>>>>>>
>>>>>> But what if we put it in arch event queue? It will be reinjected during
>>>>>> next entry attempt, so nothing bad happens and we have one less if() to explain,
>>>>>> or do I miss something terrible that will happen?
>>>>>
>>>>> I started without that if but ran into troubles with KVM-on-KVM (L1
>>>>> locks up). Let me dig out the instrumentation and check the event flow
>>>>> again.
>>>>
>>>> OK, got it again: If we transfer an IRQ that L1 wants to send to L2 into
>>>> the architectural VCPU state, we will also trigger enable_irq_window.
>>>> And that raises KVM_REQ_IMMEDIATE_EXIT again as it thinks L0 wants
>>>> inject. That will send us into an endless loop.
>>>>
>>> Why would we trigger enable_irq_window()? enable_irq_window() triggers
>>> only if interrupt is pending in one of irq chips, not in architectural
>>> VCPU state.
>>
>> Precisely this is the case if an IRQ for L1 arrived while we tried to
>> enter L2 and caused the cancellation above.
>>
> But during next entry the cancelled interrupt is transfered
> from architectural VCPU state to VM_ENTRY_INTR_INFO_FIELD by
> inject_pending_event()->vmx_inject_irq(), so at the point where
> enable_irq_window() is called the state is exactly the same no matter
> whether we canceled interrupt or not during previous entry attempt. What
> am I missing?

Maybe that we normally either have an external IRQ pending in some IRQ
chip or in the VCPU architectural state, not both at the same time? By
transferring something that doesn't come from a virtual IRQ chip of L0
(but from the one in L1) into the architectural state, we break this
assumption.

> Oh may be I am missing that if we do not cancel interrupt
> then inject_pending_event() will skip
>   if (vcpu->arch.interrupt.pending)
>     ....

If we do not cancel, we will not inject at all (due to missing
KVM_REQ_EVENT).

> and will inject interrupt from APIC that caused cancellation of previous
> entry, but then this is a bug since this new interrupt will overwrite
> the one that is still in VM_ENTRY_INTR_INFO_FIELD from previous entry
> attempt and there may be another pending interrupt in APIC anyway that
> will cause enable_irq_window() too.

Maybe the issue is that we do not properly simulate a VMEXIT on an
external interrupt during vmrun (like SVM does). Need to check for this
case again...

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html