`On 23/05/16 15:30, Alan Jenkins wrote:
On 22/05/2016, Alan Jenkins<alan.christopher.jenkins@xxxxxxxxx> wrote:
On 22/05/2016, Wanpeng Li<kernellwp@xxxxxxxxx> wrote:
2016-05-22 9:23 GMT+08:00 Wanpeng Li<kernellwp@xxxxxxxxx>:
2016-05-20 19:47 GMT+08:00 Alan Jenkins
<alan.christopher.jenkins@xxxxxxxxx>:
Hi
I'm getting a hard lockup in wait_lapic_expire(). I'm not certain why,
and it didn't seem to reproduce on a second setup. However I found a
suspect in the code for TSC deadline timers. Maybe someone is
interested
in my analysis.
If a guest has a TSC deadline timer set, it's not re-done when the TSC
is adjusted, and will fire at the wrong time. The hrtimer used for
emulation is not being recalculated. If the TSC was set _backwards_, I
think it could trigger a lockup in wait_lapic_expire(). This function
is
a busy-loop optimization, which could be tricked into busy-looping for
too long. The expected busy-wait is `lapic_timer_advance_ns`, which
defaults to 0 (I haven't changed it).
I wrote an expanded test patch, which is posted with full results on
the Fedora bughttps://bugzilla.redhat.com/show_bug.cgi?id=1337667
It confirmed the lockup can be avoided by sanity-checking the delay in
wait_lapic_expire(). With my test patch, I get a warning message but
no lockup.
My test patch also added checks for guest TSC being set backwards.
The log confirms this is happening, though I haven't confirmed exactly
how the lockup is created. It's non-obvious if there's a match
between the adjustments shown and the exact lockup duration.
(Conversion factor = reported guest tsc rate = 1.6 Ghz = host tsc
rate).
Note the kernel log shows a VM being resumed from a snapshot, twice.
I don't stop the VM before I ask virt-manager to resume from snapshot
the second time. Normally the lockup happens on that second resume.
While I'm ranting, adding the TSC checks made me think I understand
enough to detail the "real" fix. Not tested:
https://github.com/torvalds/linux/compare/master...sourcejedi:kvm_lockup_fix?expand=1
Good news! The long busy-wait goes away, basically with that patch but
1 extra line.
Inline patch for easier comment, attached patch for any testing
(unmangled by mailer). I haven't tried the unit tests yet.
More "good" news: the shorter busy-waits (50k-500k cycles) I noticed
while testing this, turned out not to be caused by my patch. I got a
burst of them at 0646 after accidentally leaving a VM running overnight,
on v4.6 + printk debugging only. My guess is cpu migration, causing the
hrtimer to fire "early" relative to the new CPU's TSC. a) The hrtimer
is not re-calculated on migration; b) I think migration is avoiding
having to bump the TSC up, if the VCPU was idling for a bit. If so, a)
suggests another one-line fix.
(I came up with some code to handle drift, while I was thrashing towards
that "1 extra line". Unfortunately I'm not sure there's sufficient
justification, and it wouldn't close the above issue).
Warm regards
Alan
From b0ee649b75adf0445b9dcfa87125637d8d9c8350 Mon Sep 17 00:00:00 2001
From: Alan Jenkins <alan.christopher.jenkins@xxxxxxxxx>
Date: Wed, 8 Jun 2016 12:25:54 +0100
Subject: [PATCH] KVM: x86: Reschedule emulated TSC deadline timer when guest
TSC is written
This should also avoid an extended busy-wait in wait_lapic_expire() (and
with interrupts disabled). This lockup was previously triggered by
restoring a live snapshot of a VM.
To fix the lockup, we also need to clear expired_tscdeadline (used for the
busy-wait) when userspace adjusts the TSC.
Clearing an intended busy-wait has the potential to cause wakeup before the
TSC deadline. However I can't think of a many use cases where the host
is likely to adjust the TSC of a running VCPU. In the case of snapshot
restore, we can avoid this problem by deferring the setup of
expired_tscdeadline until the timer interrupt is injected.
I think just deferring the setup of expired_tscdeadline by itself, narrows
the lockup-causing race to a smaller window. It might avoid getting
hit by snapshot restore most of the time. Therefore I made sure to test
that lockups were avoided even without the deferred setup of
expired_tscdeadline.
---
arch/x86/kvm/lapic.c | 80
++++++++++++++++++++++++++++++++++++++++++++++------
arch/x86/kvm/lapic.h | 1 +
arch/x86/kvm/x86.c | 7 ++++-
3 files changed, 79 insertions(+), 9 deletions(-)
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 1a2da0e..84abb1a 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1273,7 +1273,6 @@ static void apic_timer_expired(struct kvm_lapic *apic)
{
struct kvm_vcpu *vcpu = apic->vcpu;
struct swait_queue_head *q = &vcpu->wq;
- struct kvm_timer *ktimer = &apic->lapic_timer;
if (atomic_read(&apic->lapic_timer.pending))
return;
@@ -1283,9 +1282,6 @@ static void apic_timer_expired(struct kvm_lapic *apic)
if (swait_active(q))
swake_up(q);
-
- if (apic_lvtt_tscdeadline(apic))
- ktimer->expired_tscdeadline = ktimer->tscdeadline;
}
/*
@@ -1922,12 +1918,15 @@ int kvm_apic_accept_pic_intr(struct kvm_vcpu *vcpu)
void kvm_inject_apic_timer_irqs(struct kvm_vcpu *vcpu)
{
struct kvm_lapic *apic = vcpu->arch.apic;
+ struct kvm_timer *ktimer = &apic->lapic_timer;
- if (atomic_read(&apic->lapic_timer.pending) > 0) {
+ if (atomic_read(&ktimer->pending) > 0) {
kvm_apic_local_deliver(apic, APIC_LVTT);
- if (apic_lvtt_tscdeadline(apic))
- apic->lapic_timer.tscdeadline = 0;
- atomic_set(&apic->lapic_timer.pending, 0);
+ if (apic_lvtt_tscdeadline(apic)) {
+ ktimer->expired_tscdeadline = ktimer->tscdeadline;
+ ktimer->tscdeadline = 0;
+ }
+ atomic_set(&ktimer->pending, 0);
}
}
@@ -2007,6 +2006,71 @@ void __kvm_migrate_apic_timer(struct kvm_vcpu *vcpu)
}
/*
+ * kvm_apic_tsc_update - called when guest or host changes the TSC
+ *
+ * Update the emulated TSC deadline timer.
+ *
+ * It is also required if host changes TSC frequency scaling. It is not
+ * required when we "catch up" the TSC, because that is maintaining the
+ * relationship between TSC and real time.
+ */
+void kvm_apic_tsc_update(struct kvm_vcpu *vcpu)
+{
+ struct kvm_lapic *apic = vcpu->arch.apic;
+ struct kvm_timer *ktimer = &apic->lapic_timer;
+
+ if (!lapic_in_kernel(vcpu))
+ return;
+
+ if (!apic_lvtt_tscdeadline(apic))
+ return;
+
+ hrtimer_cancel(&ktimer->timer);
+
+ /*
+ * Handle when guest TSC is adjusted backwards, just after
+ * inject_apic_timer_irqs(). When we hit wait_lapic_expire(), we risk
+ * an extended busy-wait. Because ktimer->expired_tscdeadline will
+ * have receded further into the future.
+ *
+ * The guest does not get a chance to run in this window, but the host
+ * could modify the TSC at this point. This race could happen when
+ * restoring a live snapshot of a VM.
+ *
+ * Just clear the busy-wait. In theory this could cause a wakeup
+ * before the deadline, if the user configured the the busy-wait
+ * feature (lapic_advance_timer_ns). We don't expect this to matter:
+ *
+ * The only use case for a TSC rewind is a full state restore.
+ * Restoring the TSC deadline as well will fix things up (before
+ * or after, it doesn't matter), given that you don't re-enter the
+ * guest in-between. (Doing so would allow another race with
+ * inject_apic_timer_iqs()).
+ *
+ * This implementation is tightly coupled with the current
+ * inject_apic_timer_irqs():
+ *
+ * Assumption: ktimer->tscdeadline is not cleared until the timer
+ * interrupt is injected. So if the interrupt has not been injected
+ * yet, start_apic_timer() can re-calculate the time until
+ * tscdeadline.
+ *
+ * Assumption: ktimer->expired_tscdeadline is set when the timer
+ * interrupt is injected. If the interrupt has not been injected yet,
+ * start_apic_timer() will clear the pending timer and reschedule it.
+ * The busy-wait will be set up once the timer is finally injected.
+ */
+ ktimer->expired_tscdeadline = 0;
+
+ /*
+ * Recalculate the timer.
+ * Note when the emulated timer fires, it clears the tsc deadline.
+ * So we won't cause the timer to fire a second time.
+ */
+ start_apic_timer(apic);
+}
+
+/*
* apic_sync_pv_eoi_from_guest - called on vmexit or cancel interrupt
*
* Detect whether guest triggered PV EOI since the
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index f71183e..b441793 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -77,6 +77,7 @@ int kvm_lapic_find_highest_irr(struct kvm_vcpu *vcpu);
u64 kvm_get_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu);
void kvm_set_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu, u64 data);
+void kvm_apic_tsc_update(struct kvm_vcpu *vcpu);
void kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset);
void kvm_apic_set_eoi_accelerated(struct kvm_vcpu *vcpu, int vector);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9b7798c..87c36dc 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1531,6 +1531,8 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu, struct
msr_data *msr)
kvm_track_tsc_matching(vcpu);
spin_unlock(&kvm->arch.pvclock_gtod_sync_lock);
+
+ kvm_apic_tsc_update(vcpu);
}
EXPORT_SYMBOL_GPL(kvm_write_tsc);
@@ -2094,6 +2096,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu,
struct msr_data *msr_info)
if (!msr_info->host_initiated) {
s64 adj = data - vcpu->arch.ia32_tsc_adjust_msr;
adjust_tsc_offset_guest(vcpu, adj);
+ kvm_apic_tsc_update(vcpu);
}
vcpu->arch.ia32_tsc_adjust_msr = data;
}
@@ -3474,8 +3477,10 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
if (user_tsc_khz == 0)
user_tsc_khz = tsc_khz;
- if (!kvm_set_tsc_khz(vcpu, user_tsc_khz))
+ if (!kvm_set_tsc_khz(vcpu, user_tsc_khz)) {
r = 0;
+ kvm_apic_tsc_update(vcpu);
+ }
goto out;
}
--
2.5.5
>From b0ee649b75adf0445b9dcfa87125637d8d9c8350 Mon Sep 17 00:00:00 2001
From: Alan Jenkins <alan.christopher.jenkins@xxxxxxxxx>
Date: Wed, 8 Jun 2016 12:25:54 +0100
Subject: [PATCH] KVM: x86: Reschedule emulated TSC deadline timer when guest
TSC is written
This should also avoid an extended busy-wait in wait_lapic_expire() (and
with interrupts disabled). This lockup was previously triggered by
restoring a live snapshot of a VM.
To fix the lockup, we also need to clear expired_tscdeadline (used for the
busy-wait) when userspace adjusts the TSC.
Clearing an intended busy-wait has the potential to cause wakeup before the
TSC deadline. However I can't think of a many use cases where the host
is likely to adjust the TSC of a running VCPU. In the case of snapshot
restore, we can avoid this problem by deferring the setup of
expired_tscdeadline until the timer interrupt is injected.
I think just deferring the setup of expired_tscdeadline by itself, narrows
the lockup-causing race to a smaller window. It might avoid getting
hit by snapshot restore most of the time. Therefore I made sure to test
that lockups were avoided even without the deferred setup of
expired_tscdeadline.
---
arch/x86/kvm/lapic.c | 80 ++++++++++++++++++++++++++++++++++++++++++++++------
arch/x86/kvm/lapic.h | 1 +
arch/x86/kvm/x86.c | 7 ++++-
3 files changed, 79 insertions(+), 9 deletions(-)
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 1a2da0e..84abb1a 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1273,7 +1273,6 @@ static void apic_timer_expired(struct kvm_lapic *apic)
{
struct kvm_vcpu *vcpu = apic->vcpu;
struct swait_queue_head *q = &vcpu->wq;
- struct kvm_timer *ktimer = &apic->lapic_timer;
if (atomic_read(&apic->lapic_timer.pending))
return;
@@ -1283,9 +1282,6 @@ static void apic_timer_expired(struct kvm_lapic *apic)
if (swait_active(q))
swake_up(q);
-
- if (apic_lvtt_tscdeadline(apic))
- ktimer->expired_tscdeadline = ktimer->tscdeadline;
}
/*
@@ -1922,12 +1918,15 @@ int kvm_apic_accept_pic_intr(struct kvm_vcpu *vcpu)
void kvm_inject_apic_timer_irqs(struct kvm_vcpu *vcpu)
{
struct kvm_lapic *apic = vcpu->arch.apic;
+ struct kvm_timer *ktimer = &apic->lapic_timer;
- if (atomic_read(&apic->lapic_timer.pending) > 0) {
+ if (atomic_read(&ktimer->pending) > 0) {
kvm_apic_local_deliver(apic, APIC_LVTT);
- if (apic_lvtt_tscdeadline(apic))
- apic->lapic_timer.tscdeadline = 0;
- atomic_set(&apic->lapic_timer.pending, 0);
+ if (apic_lvtt_tscdeadline(apic)) {
+ ktimer->expired_tscdeadline = ktimer->tscdeadline;
+ ktimer->tscdeadline = 0;
+ }
+ atomic_set(&ktimer->pending, 0);
}
}
@@ -2007,6 +2006,71 @@ void __kvm_migrate_apic_timer(struct kvm_vcpu *vcpu)
}
/*
+ * kvm_apic_tsc_update - called when guest or host changes the TSC
+ *
+ * Update the emulated TSC deadline timer.
+ *
+ * It is also required if host changes TSC frequency scaling. It is not
+ * required when we "catch up" the TSC, because that is maintaining the
+ * relationship between TSC and real time.
+ */
+void kvm_apic_tsc_update(struct kvm_vcpu *vcpu)
+{
+ struct kvm_lapic *apic = vcpu->arch.apic;
+ struct kvm_timer *ktimer = &apic->lapic_timer;
+
+ if (!lapic_in_kernel(vcpu))
+ return;
+
+ if (!apic_lvtt_tscdeadline(apic))
+ return;
+
+ hrtimer_cancel(&ktimer->timer);
+
+ /*
+ * Handle when guest TSC is adjusted backwards, just after
+ * inject_apic_timer_irqs(). When we hit wait_lapic_expire(), we risk
+ * an extended busy-wait. Because ktimer->expired_tscdeadline will
+ * have receded further into the future.
+ *
+ * The guest does not get a chance to run in this window, but the host
+ * could modify the TSC at this point. This race could happen when
+ * restoring a live snapshot of a VM.
+ *
+ * Just clear the busy-wait. In theory this could cause a wakeup
+ * before the deadline, if the user configured the the busy-wait
+ * feature (lapic_advance_timer_ns). We don't expect this to matter:
+ *
+ * The only use case for a TSC rewind is a full state restore.
+ * Restoring the TSC deadline as well will fix things up (before
+ * or after, it doesn't matter), given that you don't re-enter the
+ * guest in-between. (Doing so would allow another race with
+ * inject_apic_timer_iqs()).
+ *
+ * This implementation is tightly coupled with the current
+ * inject_apic_timer_irqs():
+ *
+ * Assumption: ktimer->tscdeadline is not cleared until the timer
+ * interrupt is injected. So if the interrupt has not been injected
+ * yet, start_apic_timer() can re-calculate the time until
+ * tscdeadline.
+ *
+ * Assumption: ktimer->expired_tscdeadline is set when the timer
+ * interrupt is injected. If the interrupt has not been injected yet,
+ * start_apic_timer() will clear the pending timer and reschedule it.
+ * The busy-wait will be set up once the timer is finally injected.
+ */
+ ktimer->expired_tscdeadline = 0;
+
+ /*
+ * Recalculate the timer.
+ * Note when the emulated timer fires, it clears the tsc deadline.
+ * So we won't cause the timer to fire a second time.
+ */
+ start_apic_timer(apic);
+}
+
+/*
* apic_sync_pv_eoi_from_guest - called on vmexit or cancel interrupt
*
* Detect whether guest triggered PV EOI since the
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index f71183e..b441793 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -77,6 +77,7 @@ int kvm_lapic_find_highest_irr(struct kvm_vcpu *vcpu);
u64 kvm_get_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu);
void kvm_set_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu, u64 data);
+void kvm_apic_tsc_update(struct kvm_vcpu *vcpu);
void kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset);
void kvm_apic_set_eoi_accelerated(struct kvm_vcpu *vcpu, int vector);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9b7798c..87c36dc 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1531,6 +1531,8 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu, struct msr_data *msr)
kvm_track_tsc_matching(vcpu);
spin_unlock(&kvm->arch.pvclock_gtod_sync_lock);
+
+ kvm_apic_tsc_update(vcpu);
}
EXPORT_SYMBOL_GPL(kvm_write_tsc);
@@ -2094,6 +2096,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
if (!msr_info->host_initiated) {
s64 adj = data - vcpu->arch.ia32_tsc_adjust_msr;
adjust_tsc_offset_guest(vcpu, adj);
+ kvm_apic_tsc_update(vcpu);
}
vcpu->arch.ia32_tsc_adjust_msr = data;
}
@@ -3474,8 +3477,10 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
if (user_tsc_khz == 0)
user_tsc_khz = tsc_khz;
- if (!kvm_set_tsc_khz(vcpu, user_tsc_khz))
+ if (!kvm_set_tsc_khz(vcpu, user_tsc_khz)) {
r = 0;
+ kvm_apic_tsc_update(vcpu);
+ }
goto out;
}
--
2.5.5