Il 12/03/2014 11:40, Radim Krčmář ha scritto:
2014-03-11 22:05-0300, Marcelo Tosatti:
On Tue, Mar 11, 2014 at 07:11:18PM +0100, Radim Krčmář wrote:
We always disable cr8 intercept in its handler, but only re-enable it
if handling KVM_REQ_EVENT, so there can be a window where we do not
intercept cr8 writes, which allows an interrupt to disrupt a higher
priority task.
Fix this by disabling intercepts in the same function that re-enables
them when needed. This fixes BSOD in Windows 2008.
Cc: <stable@xxxxxxxxxxxxxxx>
Signed-off-by: Radim Krčmář <rkrcmar@xxxxxxxxxx>
---
arch/x86/kvm/svm.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 64d9bb9..f676c18 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -3003,10 +3003,8 @@ static int cr8_write_interception(struct vcpu_svm *svm)
u8 cr8_prev = kvm_get_cr8(&svm->vcpu);
/* instruction emulation calls kvm_set_cr8() */
r = cr_interception(svm);
- if (irqchip_in_kernel(svm->vcpu.kvm)) {
- clr_cr_intercept(svm, INTERCEPT_CR8_WRITE);
+ if (irqchip_in_kernel(svm->vcpu.kvm))
return r;
- }
I think that the old code here makes little sense, and for two reasons:
1) There are other ways to change the TPR, and the condition for
setting/clearing the CR8 intercept should be the same for all of them.
Now CR8 is the really optimized one, but there is no reason to treat
them differently and it just complicates understanding the code.
So it is a good thing that your patch moves the clearing of the CR8
write intercept in a generic place (the setting of the intercept is
already generic). Doesn't say much about the correctness of the patch;
it would be just an optimization. But at least the old code is the
"smelly" one.
2) Unconditionally disabling the CR8 intercept is definitely wrong.
What matters is the change in the PPR; if the processor priority is the
same as before, or higher, absolutely nothing has changed from the point
of view of interrupt delivery; if we had an interrupt in the IRR,
waiting to be delivered, we still have it, and we should keep the CR8
intercept enabled.
Your patch does the right thing by virtue of apic_update_ppr setting
KVM_REQ_EVENT (which ultimately calls update_cr8_intercept) exactly if
the PPR has been lowered. The call chain is
kvm_set_cr8->kvm_lapic_set_tpr->apic_set_tpr->apic_update_ppr.
At the end of this email I'll show an example of why this actually is
relatively common on Windows guests.
So, IMO there is no doubt that the change is semantically correct. The
next question then is whether it undoes the V_TPR optimization. You can
prove it by a sort of induction; consider a sequences of events that
start and end with the same IRR, assume CR8 is not intercepted at the
beginning, and prove that CR8 is still not intercepted afterwards.
We can assume that all changes to the TPR are balanced and properly
nested (except you can go low->med->high->low).
The simple sequences are:
1) changes in TPR with no interrupts in the middle; remember that
Windows doesn't really ever disable/enable preemption or interrupt flags
like Linux does. It only modifies the TPR ("raise/lower the IRQL", they
call it). We're assuming that the CR8 intercept is initially disabled,
so a raised-IRQL section of the code that doesn't cause other vmexits
will obviously run at full speed. Not much to see here.
2) delivery of an unmasked interrupt (with priority P) and subsequent
EOI. Changes to TPR don't really matter until EOI, because they are
always to priority >= P and they are balanced. So we ignore them.
To summarize: an interrupt with priority P is going to be delivered, the
VCPU is running at TPR <= P, interrupts are allowed, and the CR8
intercept is disabled.
The interrupt is injected via apic_set_irr/kvm_make_request, and this
causes a call to update_cr8_intercept. If TPR < P, the intercept will
remain disabled. When the EOI is sent, we get another event and another
call to update_cr8_intercept; again, the intercept stays cleared because
IRR == -1.
If TPR == P, the intercept is set while the interrupt handler runs, but
it is still disabled at the end of the interrupt. Looks like another
bugfix; before your patch, it would remain enabled, which is useless.
The TPR == P case is actually interesting for Windows, more below.
The complicated sequences are:
3) a change in the TPR, where an interrupt is masked while the
high-priority task runs. The interrupt is what will cause the intercept
to be set. As soon as the TPR is restored, the interrupt will be
delivered and the intercept cleared (TPR < IRR). We fall back to case 2
above.
4) interrupts that are received while interrupts are not allowed. Here
KVM does not inject the interrupt; it requests the interrupt window and
clears the intercept. Clearing the intercept is okay, because no
interrupt can be delivered anyway and TPR changes are moot until the
interrupt window opens. Once it opens, sync_cr8_to_lapic will call
kvm_set_cr8 and interrupt_window_interception will set KVM_REQ_EVENT.
Any intervening change to the TPR is handled fine: if it got high again,
update_cr8_intercept will set the CR8 intercept and the "loop" restarts;
if it is still low, inject_pending_event will queue the interrupt as in
case 3.
With all these covered, is the base case true? That is, will the CR8
intercept _ever_ be disabled? Yes :) because kvm_vcpu_reset sets
KVM_REQ_EVENT and this calls update_cr8_intercept.
Do you agree with the above analysis? If so, let's look how it can be
used to reply to Marcelo's question.
Shouldnt IRR be injected if TPR < IRR ? (via KVM_REQ_EVENT).
This is "the complicated case" above. Say initially TPR = 15.
1) IRR has interrupt 10.
Interrupt 10 is received and the CR8 intercept is set. Because PPR =
15, the interrupt is not yet in the ISR, only in the IRR.
2) TPR now 9 due to CR8 write.
When the TPR becomes 9, the PPR is lowered to 9 too, thus
apic_update_ppr sets KVM_REQ_EVENT.
3) 10 should be injected.
Event processing calls inject_pending_event, which either queues the
interrupt or requests the interrupt window. In my writing above, this
corresponds to cases 3 and 4 respectively.
As usual, correct me if I'm wrong.
Also not clearing the intercept can cause continuous CR8 writes to
exit until KVM_REQ_EVENT ?
It is intended, I suppose this is because we run with V_INTR_MASKING, so
writes to CR8 only affect V_TPR register; guest then raises it once more
and APIC incorrectly gives us low priority interrupt.
Yes, but the extra exits should not be a problem.
Let's see what could happen with Windows. Windows uses "deferred
procedure calls" (also known as DPCs; basically they're non-preemptible
"bottom halves") extensively. Almost all ISRs will do the bulk of the
work in a DPC, including the timer interrupt. DPC processing is
requested with a priority-2 IPI, hence Windows raises the TPR to 2 in
order to disable preemption.
This is the gory detail of why, as mentioned above, Windows writes to
CR8 in order to disable preemption. Since spinlocks are not
preemptible, taking a spinlock does the same. In both cases the old CR8
value is saved and then restored (when exiting the non-preemptible
section, or when releasing the spinlock).
So, if you have N nested spinlocks and a timer interrupt fires, you will
have up to 2N-1 CR8 write exits. Example with 2 nested spinlocks:
IRR TPR PPR intercept
initial {} 0 0 disabled
take spinlock #1 {} 2 2 disabled
>>> timer interrupt injected {13} 2 13 disabled
>>> timer ISR schedules DPCs {2,13} 2 13 disabled
>>> timer interrupt EOI {2} 2 2 enabled
take spinlock #2 (vmexit!) {2} 2 2 enabled
release spinlock #2 (vmexit!) {2} 2 2 enabled
release spinlock #1 (vmexit!) {2} 0 0 enabled
>>> DPC interrupt injected {2} 0 2 disabled
>>> set CR8 = 2 {2} 2 2 enabled
>>> DPC interrupt EOI {} 2 2 disabled
>>> DPC queue processed {} 2 2 disabled
>>> set CR8 = 0 {} 0 0 disabled
Here N=2 so you have 3 extra vmexits. Note that:
* line "take spinlock #2" writes 2 to CR8, but is really a no-op. But
without this patch, the CR8 intercept would have been disabled!
* line "release spinlock #1" writes 0 to CR8. Without this patch, the
PPR would not have been updated. The DPC interrupt would have been delayed.
* line "DPC interrupt EOI" is the other bugfix/optimization I mentioned
above for TPR==P. Without this patch, the CR8 interrupt would have
stayed enabled, and KVM would have taken an exit when setting CR8 to 0.
So in this scenario with N nested spinlocks there will be indeed 2N-2
more exits than before.
Windows has an extra pair of lock/unlock primitives that do not touch
CR8, that you can use when you know you'r already at TPR=2, so we can
expect the common case to be N=1. And N=1 adds up to 0 extra vmexits.
I'm applying the patch to kvm/next.
As a follow up, I would consider the following changes:
1) stop calling update_cr8_intercept directly. There is a call in
kvm_vcpu_ioctl_set_lapic, which is useless because
kvm_apic_post_state_restore sets KVM_REQ_EVENT. Another in
kvm_arch_vcpu_ioctl_set_sregs, which is how userspace modifies CR8.
Also useless, because the function calls kvm_set_cr8 and ultimately
apic_update_ppr.
2) stop setting the CR8_WRITE intercept in init_vmcb (called by
svm_vcpu_reset, called by kvm_vcpu_reset), since it's always undone
immediately afterwards by the processing of KVM_REQ_EVENT.
Paolo
--
To unsubscribe from this list: send the line "unsubscribe stable" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html