Re: [PATCH] KVM: SVM: fix cr8 intercept window

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Il 12/03/2014 11:40, Radim Krčmář ha scritto:
2014-03-11 22:05-0300, Marcelo Tosatti:
On Tue, Mar 11, 2014 at 07:11:18PM +0100, Radim Krčmář wrote:
We always disable cr8 intercept in its handler, but only re-enable it
if handling KVM_REQ_EVENT, so there can be a window where we do not
intercept cr8 writes, which allows an interrupt to disrupt a higher
priority task.

Fix this by disabling intercepts in the same function that re-enables
them when needed. This fixes BSOD in Windows 2008.

Cc: <stable@xxxxxxxxxxxxxxx>
Signed-off-by: Radim Krčmář <rkrcmar@xxxxxxxxxx>
---
 arch/x86/kvm/svm.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 64d9bb9..f676c18 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -3003,10 +3003,8 @@ static int cr8_write_interception(struct vcpu_svm *svm)
 	u8 cr8_prev = kvm_get_cr8(&svm->vcpu);
 	/* instruction emulation calls kvm_set_cr8() */
 	r = cr_interception(svm);
-	if (irqchip_in_kernel(svm->vcpu.kvm)) {
-		clr_cr_intercept(svm, INTERCEPT_CR8_WRITE);
+	if (irqchip_in_kernel(svm->vcpu.kvm))
 		return r;
-	}

I think that the old code here makes little sense, and for two reasons:

1) There are other ways to change the TPR, and the condition for setting/clearing the CR8 intercept should be the same for all of them. Now CR8 is the really optimized one, but there is no reason to treat them differently and it just complicates understanding the code.

So it is a good thing that your patch moves the clearing of the CR8 write intercept in a generic place (the setting of the intercept is already generic). Doesn't say much about the correctness of the patch; it would be just an optimization. But at least the old code is the "smelly" one.


2) Unconditionally disabling the CR8 intercept is definitely wrong. What matters is the change in the PPR; if the processor priority is the same as before, or higher, absolutely nothing has changed from the point of view of interrupt delivery; if we had an interrupt in the IRR, waiting to be delivered, we still have it, and we should keep the CR8 intercept enabled.

Your patch does the right thing by virtue of apic_update_ppr setting KVM_REQ_EVENT (which ultimately calls update_cr8_intercept) exactly if the PPR has been lowered. The call chain is kvm_set_cr8->kvm_lapic_set_tpr->apic_set_tpr->apic_update_ppr.

At the end of this email I'll show an example of why this actually is relatively common on Windows guests.


So, IMO there is no doubt that the change is semantically correct. The next question then is whether it undoes the V_TPR optimization. You can prove it by a sort of induction; consider a sequences of events that start and end with the same IRR, assume CR8 is not intercepted at the beginning, and prove that CR8 is still not intercepted afterwards.

We can assume that all changes to the TPR are balanced and properly nested (except you can go low->med->high->low).


The simple sequences are:

1) changes in TPR with no interrupts in the middle; remember that Windows doesn't really ever disable/enable preemption or interrupt flags like Linux does. It only modifies the TPR ("raise/lower the IRQL", they call it). We're assuming that the CR8 intercept is initially disabled, so a raised-IRQL section of the code that doesn't cause other vmexits will obviously run at full speed. Not much to see here.

2) delivery of an unmasked interrupt (with priority P) and subsequent EOI. Changes to TPR don't really matter until EOI, because they are always to priority >= P and they are balanced. So we ignore them.

To summarize: an interrupt with priority P is going to be delivered, the VCPU is running at TPR <= P, interrupts are allowed, and the CR8 intercept is disabled.

The interrupt is injected via apic_set_irr/kvm_make_request, and this causes a call to update_cr8_intercept. If TPR < P, the intercept will remain disabled. When the EOI is sent, we get another event and another call to update_cr8_intercept; again, the intercept stays cleared because IRR == -1.

If TPR == P, the intercept is set while the interrupt handler runs, but it is still disabled at the end of the interrupt. Looks like another bugfix; before your patch, it would remain enabled, which is useless. The TPR == P case is actually interesting for Windows, more below.


The complicated sequences are:

3) a change in the TPR, where an interrupt is masked while the high-priority task runs. The interrupt is what will cause the intercept to be set. As soon as the TPR is restored, the interrupt will be delivered and the intercept cleared (TPR < IRR). We fall back to case 2 above.

4) interrupts that are received while interrupts are not allowed. Here KVM does not inject the interrupt; it requests the interrupt window and clears the intercept. Clearing the intercept is okay, because no interrupt can be delivered anyway and TPR changes are moot until the interrupt window opens. Once it opens, sync_cr8_to_lapic will call kvm_set_cr8 and interrupt_window_interception will set KVM_REQ_EVENT. Any intervening change to the TPR is handled fine: if it got high again, update_cr8_intercept will set the CR8 intercept and the "loop" restarts; if it is still low, inject_pending_event will queue the interrupt as in case 3.


With all these covered, is the base case true? That is, will the CR8 intercept _ever_ be disabled? Yes :) because kvm_vcpu_reset sets KVM_REQ_EVENT and this calls update_cr8_intercept.

Do you agree with the above analysis? If so, let's look how it can be used to reply to Marcelo's question.

Shouldnt IRR be injected if TPR < IRR ? (via KVM_REQ_EVENT).

This is "the complicated case" above.  Say initially TPR = 15.

1) IRR has interrupt 10.

Interrupt 10 is received and the CR8 intercept is set. Because PPR = 15, the interrupt is not yet in the ISR, only in the IRR.

2) TPR now 9 due to CR8 write.

When the TPR becomes 9, the PPR is lowered to 9 too, thus apic_update_ppr sets KVM_REQ_EVENT.

3) 10 should be injected.

Event processing calls inject_pending_event, which either queues the interrupt or requests the interrupt window. In my writing above, this corresponds to cases 3 and 4 respectively.

As usual, correct me if I'm wrong.

Also not clearing the intercept can cause continuous CR8 writes to
exit until KVM_REQ_EVENT ?

It is intended, I suppose this is because we run with V_INTR_MASKING, so
writes to CR8 only affect V_TPR register; guest then raises it once more
and APIC incorrectly gives us low priority interrupt.

Yes, but the extra exits should not be a problem.

Let's see what could happen with Windows. Windows uses "deferred procedure calls" (also known as DPCs; basically they're non-preemptible "bottom halves") extensively. Almost all ISRs will do the bulk of the work in a DPC, including the timer interrupt. DPC processing is requested with a priority-2 IPI, hence Windows raises the TPR to 2 in order to disable preemption.

This is the gory detail of why, as mentioned above, Windows writes to CR8 in order to disable preemption. Since spinlocks are not preemptible, taking a spinlock does the same. In both cases the old CR8 value is saved and then restored (when exiting the non-preemptible section, or when releasing the spinlock).

So, if you have N nested spinlocks and a timer interrupt fires, you will have up to 2N-1 CR8 write exits. Example with 2 nested spinlocks:

                                     IRR    TPR   PPR    intercept
initial                              {}     0     0      disabled
take spinlock #1                     {}     2     2      disabled
>>> timer interrupt injected         {13}   2     13     disabled
>>> timer ISR schedules DPCs         {2,13} 2     13     disabled
>>> timer interrupt EOI              {2}    2     2      enabled
take spinlock #2 (vmexit!)           {2}    2     2      enabled
release spinlock #2 (vmexit!)        {2}    2     2      enabled
release spinlock #1 (vmexit!)        {2}    0     0      enabled
>>> DPC interrupt injected           {2}    0     2      disabled
>>> set CR8 = 2                      {2}    2     2      enabled
>>> DPC interrupt EOI                {}     2     2      disabled
>>> DPC queue processed              {}     2     2      disabled
>>> set CR8 = 0                      {}     0     0      disabled

Here N=2 so you have 3 extra vmexits.  Note that:

* line "take spinlock #2" writes 2 to CR8, but is really a no-op. But without this patch, the CR8 intercept would have been disabled!

* line "release spinlock #1" writes 0 to CR8. Without this patch, the PPR would not have been updated. The DPC interrupt would have been delayed.

* line "DPC interrupt EOI" is the other bugfix/optimization I mentioned above for TPR==P. Without this patch, the CR8 interrupt would have stayed enabled, and KVM would have taken an exit when setting CR8 to 0.

So in this scenario with N nested spinlocks there will be indeed 2N-2 more exits than before.

Windows has an extra pair of lock/unlock primitives that do not touch CR8, that you can use when you know you'r already at TPR=2, so we can expect the common case to be N=1. And N=1 adds up to 0 extra vmexits.


I'm applying the patch to kvm/next.

As a follow up, I would consider the following changes:

1) stop calling update_cr8_intercept directly. There is a call in kvm_vcpu_ioctl_set_lapic, which is useless because kvm_apic_post_state_restore sets KVM_REQ_EVENT. Another in kvm_arch_vcpu_ioctl_set_sregs, which is how userspace modifies CR8. Also useless, because the function calls kvm_set_cr8 and ultimately apic_update_ppr.

2) stop setting the CR8_WRITE intercept in init_vmcb (called by svm_vcpu_reset, called by kvm_vcpu_reset), since it's always undone immediately afterwards by the processing of KVM_REQ_EVENT.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe stable" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Kernel]     [Kernel Development Newbies]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite Hiking]     [Linux Kernel]     [Linux SCSI]