Re: hard lockup in wait_lapic_expire() - bug in TSC deadline timer?

Alan Jenkins <alan.christopher.jenkins@xxxxxxxxx> · Thu, 9 Jun 2016 15:01:35 +0100

`On 23/05/16 15:30, Alan Jenkins wrote:
On 22/05/2016, Alan Jenkins<alan.christopher.jenkins@xxxxxxxxx>  wrote:
On 22/05/2016, Wanpeng Li<kernellwp@xxxxxxxxx>  wrote:
2016-05-22 9:23 GMT+08:00 Wanpeng Li<kernellwp@xxxxxxxxx>:
2016-05-20 19:47 GMT+08:00 Alan Jenkins
<alan.christopher.jenkins@xxxxxxxxx>:
Hi

I'm getting a hard lockup in wait_lapic_expire(). I'm not certain why,
and it didn't seem to reproduce on a second setup. However I found a
suspect in the code for TSC deadline timers. Maybe someone is
interested
in my analysis.

If a guest has a TSC deadline timer set, it's not re-done when the TSC
is adjusted, and will fire at the wrong time. The hrtimer used for
emulation is not being recalculated. If the TSC was set _backwards_, I
think it could trigger a lockup in wait_lapic_expire(). This function
is
a busy-loop optimization, which could be tricked into busy-looping for
too long. The expected busy-wait is `lapic_timer_advance_ns`, which
defaults to 0 (I haven't changed it).

I wrote an expanded test patch, which is posted with full results on
the Fedora bughttps://bugzilla.redhat.com/show_bug.cgi?id=1337667

It confirmed the lockup can be avoided by sanity-checking the delay in
wait_lapic_expire().  With my test patch, I get a warning message but
no lockup.

My test patch also added checks for guest TSC being set backwards.
The log confirms this is happening, though I haven't confirmed exactly
how the lockup is created.  It's non-obvious if there's a match
between the adjustments shown and the exact lockup duration.
(Conversion factor = reported guest tsc rate = 1.6 Ghz = host tsc
rate).

Note the kernel log shows a VM being resumed from a snapshot, twice.
I don't stop the VM before I ask virt-manager to resume from snapshot
the second time.  Normally the lockup happens on that second resume.

While I'm ranting, adding the TSC checks made me think I understand
enough to detail the "real" fix.  Not tested:

https://github.com/torvalds/linux/compare/master...sourcejedi:kvm_lockup_fix?expand=1

Good news!  The long busy-wait goes away, basically with that patch but 
1 extra line.

Inline patch for easier comment, attached patch for any testing 
(unmangled by mailer).  I haven't tried the unit tests yet.

More "good" news:  the shorter busy-waits (50k-500k cycles) I noticed 
while testing this, turned out not to be caused by my patch.  I got a 
burst of them at 0646 after accidentally leaving a VM running overnight, 
on v4.6 + printk debugging only.  My guess is cpu migration, causing the 
hrtimer to fire "early" relative to the new CPU's TSC.  a) The hrtimer 
is not re-calculated on migration; b) I think migration is avoiding 
having to bump the TSC up, if the VCPU was idling for a bit.  If so, a) 
suggests another one-line fix.

(I came up with some code to handle drift, while I was thrashing towards 
that "1 extra line".  Unfortunately I'm not sure there's sufficient 
justification, and it wouldn't close the above issue).

Warm regards
Alan


From b0ee649b75adf0445b9dcfa87125637d8d9c8350 Mon Sep 17 00:00:00 2001
From: Alan Jenkins <alan.christopher.jenkins@xxxxxxxxx>
Date: Wed, 8 Jun 2016 12:25:54 +0100
Subject: [PATCH] KVM: x86: Reschedule emulated TSC deadline timer when guest
   TSC is written

This should also avoid an extended busy-wait in wait_lapic_expire() (and
with interrupts disabled).  This lockup was previously triggered by
restoring a live snapshot of a VM.

To fix the lockup, we also need to clear expired_tscdeadline (used for the
busy-wait) when userspace adjusts the TSC.

Clearing an intended busy-wait has the potential to cause wakeup before the
TSC deadline.  However I can't think of a many use cases where the host
is likely to adjust the TSC of a running VCPU.  In the case of snapshot
restore, we can avoid this problem by deferring the setup of
expired_tscdeadline until the timer interrupt is injected.

I think just deferring the setup of expired_tscdeadline by itself, narrows
the lockup-causing race to a smaller window.  It might avoid getting
hit by snapshot restore most of the time.  Therefore I made sure to test
that lockups were avoided even without the deferred setup of
expired_tscdeadline.
---
   arch/x86/kvm/lapic.c | 80 
++++++++++++++++++++++++++++++++++++++++++++++------
   arch/x86/kvm/lapic.h |  1 +
   arch/x86/kvm/x86.c   |  7 ++++-
   3 files changed, 79 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 1a2da0e..84abb1a 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1273,7 +1273,6 @@ static void apic_timer_expired(struct kvm_lapic *apic)
   {
       struct kvm_vcpu *vcpu = apic->vcpu;
       struct swait_queue_head *q = &vcpu->wq;
-    struct kvm_timer *ktimer = &apic->lapic_timer;

       if (atomic_read(&apic->lapic_timer.pending))
           return;
@@ -1283,9 +1282,6 @@ static void apic_timer_expired(struct kvm_lapic *apic)

       if (swait_active(q))
           swake_up(q);
-
-    if (apic_lvtt_tscdeadline(apic))
-        ktimer->expired_tscdeadline = ktimer->tscdeadline;
   }

   /*
@@ -1922,12 +1918,15 @@ int kvm_apic_accept_pic_intr(struct kvm_vcpu *vcpu)
   void kvm_inject_apic_timer_irqs(struct kvm_vcpu *vcpu)
   {
       struct kvm_lapic *apic = vcpu->arch.apic;
+    struct kvm_timer *ktimer = &apic->lapic_timer;

-    if (atomic_read(&apic->lapic_timer.pending) > 0) {
+    if (atomic_read(&ktimer->pending) > 0) {
           kvm_apic_local_deliver(apic, APIC_LVTT);
-        if (apic_lvtt_tscdeadline(apic))
-            apic->lapic_timer.tscdeadline = 0;
-        atomic_set(&apic->lapic_timer.pending, 0);
+        if (apic_lvtt_tscdeadline(apic)) {
+            ktimer->expired_tscdeadline = ktimer->tscdeadline;
+            ktimer->tscdeadline = 0;
+        }
+        atomic_set(&ktimer->pending, 0);
       }
   }

@@ -2007,6 +2006,71 @@ void __kvm_migrate_apic_timer(struct kvm_vcpu *vcpu)
   }

   /*
+ * kvm_apic_tsc_update - called when guest or host changes the TSC
+ *
+ * Update the emulated TSC deadline timer.
+ *
+ * It is also required if host changes TSC frequency scaling.  It is not
+ * required when we "catch up" the TSC, because that is maintaining the
+ * relationship between TSC and real time.
+ */
+void kvm_apic_tsc_update(struct kvm_vcpu *vcpu)
+{
+    struct kvm_lapic *apic = vcpu->arch.apic;
+    struct kvm_timer *ktimer = &apic->lapic_timer;
+
+    if (!lapic_in_kernel(vcpu))
+        return;
+
+    if (!apic_lvtt_tscdeadline(apic))
+        return;
+
+    hrtimer_cancel(&ktimer->timer);
+
+    /*
+     * Handle when guest TSC is adjusted backwards, just after
+     * inject_apic_timer_irqs().  When we hit wait_lapic_expire(), we risk
+     * an extended busy-wait.  Because ktimer->expired_tscdeadline will
+     * have receded further into the future.
+     *
+     * The guest does not get a chance to run in this window, but the host
+     * could modify the TSC at this point.  This race could happen when
+     * restoring a live snapshot of a VM.
+     *
+     * Just clear the busy-wait.  In theory this could cause a wakeup
+     * before the deadline, if the user configured the the busy-wait
+     * feature (lapic_advance_timer_ns).  We don't expect this to matter:
+     *
+     * The only use case for a TSC rewind is a full state restore.
+     * Restoring the TSC deadline as well will fix things up (before
+     * or after, it doesn't matter), given that you don't re-enter the
+     * guest in-between.  (Doing so would allow another race with
+     * inject_apic_timer_iqs()).
+     *
+     * This implementation is tightly coupled with the current
+     * inject_apic_timer_irqs():
+     *
+     * Assumption: ktimer->tscdeadline is not cleared until the timer
+     * interrupt is injected.  So if the interrupt has not been injected
+     * yet, start_apic_timer() can re-calculate the time until
+     * tscdeadline.
+     *
+     * Assumption: ktimer->expired_tscdeadline is set when the timer
+     * interrupt is injected.  If the interrupt has not been injected yet,
+     * start_apic_timer() will clear the pending timer and reschedule it.
+     * The busy-wait will be set up once the timer is finally injected.
+     */
+    ktimer->expired_tscdeadline = 0;
+
+    /*
+     * Recalculate the timer.
+     * Note when the emulated timer fires, it clears the tsc deadline.
+     * So we won't cause the timer to fire a second time.
+     */
+    start_apic_timer(apic);
+}
+
+/*
    * apic_sync_pv_eoi_from_guest - called on vmexit or cancel interrupt
    *
    * Detect whether guest triggered PV EOI since the
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index f71183e..b441793 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -77,6 +77,7 @@ int kvm_lapic_find_highest_irr(struct kvm_vcpu *vcpu);

   u64 kvm_get_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu);
   void kvm_set_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu, u64 data);
+void kvm_apic_tsc_update(struct kvm_vcpu *vcpu);

   void kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset);
   void kvm_apic_set_eoi_accelerated(struct kvm_vcpu *vcpu, int vector);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9b7798c..87c36dc 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1531,6 +1531,8 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu, struct 
msr_data *msr)

       kvm_track_tsc_matching(vcpu);
       spin_unlock(&kvm->arch.pvclock_gtod_sync_lock);
+
+    kvm_apic_tsc_update(vcpu);
   }

   EXPORT_SYMBOL_GPL(kvm_write_tsc);
@@ -2094,6 +2096,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, 
struct msr_data *msr_info)
               if (!msr_info->host_initiated) {
                   s64 adj = data - vcpu->arch.ia32_tsc_adjust_msr;
                   adjust_tsc_offset_guest(vcpu, adj);
+                kvm_apic_tsc_update(vcpu);
               }
               vcpu->arch.ia32_tsc_adjust_msr = data;
           }
@@ -3474,8 +3477,10 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
           if (user_tsc_khz == 0)
               user_tsc_khz = tsc_khz;

-        if (!kvm_set_tsc_khz(vcpu, user_tsc_khz))
+        if (!kvm_set_tsc_khz(vcpu, user_tsc_khz)) {
               r = 0;
+            kvm_apic_tsc_update(vcpu);
+        }

           goto out;
       }
--
2.5.5



>From b0ee649b75adf0445b9dcfa87125637d8d9c8350 Mon Sep 17 00:00:00 2001
From: Alan Jenkins <alan.christopher.jenkins@xxxxxxxxx>
Date: Wed, 8 Jun 2016 12:25:54 +0100
Subject: [PATCH] KVM: x86: Reschedule emulated TSC deadline timer when guest
 TSC is written

This should also avoid an extended busy-wait in wait_lapic_expire() (and
with interrupts disabled).  This lockup was previously triggered by
restoring a live snapshot of a VM.

To fix the lockup, we also need to clear expired_tscdeadline (used for the
busy-wait) when userspace adjusts the TSC.

Clearing an intended busy-wait has the potential to cause wakeup before the
TSC deadline.  However I can't think of a many use cases where the host
is likely to adjust the TSC of a running VCPU.  In the case of snapshot
restore, we can avoid this problem by deferring the setup of
expired_tscdeadline until the timer interrupt is injected.

I think just deferring the setup of expired_tscdeadline by itself, narrows
the lockup-causing race to a smaller window.  It might avoid getting
hit by snapshot restore most of the time.  Therefore I made sure to test
that lockups were avoided even without the deferred setup of
expired_tscdeadline.
---
 arch/x86/kvm/lapic.c | 80 ++++++++++++++++++++++++++++++++++++++++++++++------
 arch/x86/kvm/lapic.h |  1 +
 arch/x86/kvm/x86.c   |  7 ++++-
 3 files changed, 79 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 1a2da0e..84abb1a 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1273,7 +1273,6 @@ static void apic_timer_expired(struct kvm_lapic *apic)
 {
 	struct kvm_vcpu *vcpu = apic->vcpu;
 	struct swait_queue_head *q = &vcpu->wq;
-	struct kvm_timer *ktimer = &apic->lapic_timer;
 
 	if (atomic_read(&apic->lapic_timer.pending))
 		return;
@@ -1283,9 +1282,6 @@ static void apic_timer_expired(struct kvm_lapic *apic)
 
 	if (swait_active(q))
 		swake_up(q);
-
-	if (apic_lvtt_tscdeadline(apic))
-		ktimer->expired_tscdeadline = ktimer->tscdeadline;
 }
 
 /*
@@ -1922,12 +1918,15 @@ int kvm_apic_accept_pic_intr(struct kvm_vcpu *vcpu)
 void kvm_inject_apic_timer_irqs(struct kvm_vcpu *vcpu)
 {
 	struct kvm_lapic *apic = vcpu->arch.apic;
+	struct kvm_timer *ktimer = &apic->lapic_timer;
 
-	if (atomic_read(&apic->lapic_timer.pending) > 0) {
+	if (atomic_read(&ktimer->pending) > 0) {
 		kvm_apic_local_deliver(apic, APIC_LVTT);
-		if (apic_lvtt_tscdeadline(apic))
-			apic->lapic_timer.tscdeadline = 0;
-		atomic_set(&apic->lapic_timer.pending, 0);
+		if (apic_lvtt_tscdeadline(apic)) {
+			ktimer->expired_tscdeadline = ktimer->tscdeadline;
+			ktimer->tscdeadline = 0;
+		}
+		atomic_set(&ktimer->pending, 0);
 	}
 }
 
@@ -2007,6 +2006,71 @@ void __kvm_migrate_apic_timer(struct kvm_vcpu *vcpu)
 }
 
 /*
+ * kvm_apic_tsc_update - called when guest or host changes the TSC
+ *
+ * Update the emulated TSC deadline timer.
+ *
+ * It is also required if host changes TSC frequency scaling.  It is not
+ * required when we "catch up" the TSC, because that is maintaining the
+ * relationship between TSC and real time.
+ */
+void kvm_apic_tsc_update(struct kvm_vcpu *vcpu)
+{
+	struct kvm_lapic *apic = vcpu->arch.apic;
+	struct kvm_timer *ktimer = &apic->lapic_timer;
+
+	if (!lapic_in_kernel(vcpu))
+		return;
+
+	if (!apic_lvtt_tscdeadline(apic))
+		return;
+
+	hrtimer_cancel(&ktimer->timer);
+
+	/*
+	 * Handle when guest TSC is adjusted backwards, just after
+	 * inject_apic_timer_irqs().  When we hit wait_lapic_expire(), we risk
+	 * an extended busy-wait.  Because ktimer->expired_tscdeadline will
+	 * have receded further into the future.
+	 *
+	 * The guest does not get a chance to run in this window, but the host
+	 * could modify the TSC at this point.  This race could happen when
+	 * restoring a live snapshot of a VM.
+	 *
+	 * Just clear the busy-wait.  In theory this could cause a wakeup
+	 * before the deadline, if the user configured the the busy-wait
+	 * feature (lapic_advance_timer_ns).  We don't expect this to matter:
+	 *
+	 * The only use case for a TSC rewind is a full state restore.
+	 * Restoring the TSC deadline as well will fix things up (before
+	 * or after, it doesn't matter), given that you don't re-enter the
+	 * guest in-between.  (Doing so would allow another race with
+	 * inject_apic_timer_iqs()).
+	 *
+	 * This implementation is tightly coupled with the current
+	 * inject_apic_timer_irqs():
+	 *
+	 * Assumption: ktimer->tscdeadline is not cleared until the timer
+	 * interrupt is injected.  So if the interrupt has not been injected
+	 * yet, start_apic_timer() can re-calculate the time until
+	 * tscdeadline.
+	 *
+	 * Assumption: ktimer->expired_tscdeadline is set when the timer
+	 * interrupt is injected.  If the interrupt has not been injected yet,
+	 * start_apic_timer() will clear the pending timer and reschedule it.
+	 * The busy-wait will be set up once the timer is finally injected.
+	 */
+	ktimer->expired_tscdeadline = 0;
+
+	/*
+	 * Recalculate the timer.
+	 * Note when the emulated timer fires, it clears the tsc deadline.
+	 * So we won't cause the timer to fire a second time.
+	 */
+	start_apic_timer(apic);
+}
+
+/*
  * apic_sync_pv_eoi_from_guest - called on vmexit or cancel interrupt
  *
  * Detect whether guest triggered PV EOI since the
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index f71183e..b441793 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -77,6 +77,7 @@ int kvm_lapic_find_highest_irr(struct kvm_vcpu *vcpu);
 
 u64 kvm_get_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu);
 void kvm_set_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu, u64 data);
+void kvm_apic_tsc_update(struct kvm_vcpu *vcpu);
 
 void kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset);
 void kvm_apic_set_eoi_accelerated(struct kvm_vcpu *vcpu, int vector);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9b7798c..87c36dc 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1531,6 +1531,8 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu, struct msr_data *msr)
 
 	kvm_track_tsc_matching(vcpu);
 	spin_unlock(&kvm->arch.pvclock_gtod_sync_lock);
+
+	kvm_apic_tsc_update(vcpu);
 }
 
 EXPORT_SYMBOL_GPL(kvm_write_tsc);
@@ -2094,6 +2096,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 			if (!msr_info->host_initiated) {
 				s64 adj = data - vcpu->arch.ia32_tsc_adjust_msr;
 				adjust_tsc_offset_guest(vcpu, adj);
+				kvm_apic_tsc_update(vcpu);
 			}
 			vcpu->arch.ia32_tsc_adjust_msr = data;
 		}
@@ -3474,8 +3477,10 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
 		if (user_tsc_khz == 0)
 			user_tsc_khz = tsc_khz;
 
-		if (!kvm_set_tsc_khz(vcpu, user_tsc_khz))
+		if (!kvm_set_tsc_khz(vcpu, user_tsc_khz)) {
 			r = 0;
+			kvm_apic_tsc_update(vcpu);
+		}
 
 		goto out;
 	}
-- 
2.5.5