[RFC PATCH] KVM: x86/xen: don't use broken get_kvmclock_ns() for Xen timers

David Woodhouse <dwmw2@xxxxxxxxxxxxx> · Fri, 27 Oct 2023 14:57:48 +0100

From: David Woodhouse <dwmw@xxxxxxxxxxxx>

The get_kvmclock_ns() function is broken. It isn't *just* the general
brokenness of the KVM clock which sometimes gets yanked back into line
with the host's CLOCK_MONOTONIC_RAW, potentially causing time to go
backwards for the guest. It's worse than that; even when the clocks
don't get resynced, get_kvmclock_ns() calculates time differently to
the way the guest does, causing a systemic error.

The guest sees a TSC value scaled from the host TSC by a mul/shift
conversion (hopefully done in hardware). It then converts that guest
TSC value into nanoseconds using the mul/shift conversion given to it
by the KVM pvclock information.

But get_kvmclock_ns() performs only a single conversion directly from
host TSC to nanoseconds, giving a different result. A test program at
http://david.woodhou.se/tsdrift.c demonstrates the cumulative error
over a day.

From the guest, http://david.woodhou.se/timerlat.c demonstrates ever
increasing inaccuracies in the length of the actual delay observed when
asking for e.g. 100µs of sleep.

root@ip-10-0-193-21:~# ./timerlat -c -n 5
00000000 latency 103243/100000 (3.2430%)
00000001 latency 103243/100000 (3.2430%)
00000002 latency 103242/100000 (3.2420%)
00000003 latency 103245/100000 (3.2450%)
00000004 latency 103245/100000 (3.2450%)

As a short-term fix, just calculate the kvmclock precisely the same
way the guest does when setting timers.

XX: does this even need to be conditional on constant TSC? Even if
use_master_clock isn't true, the guest will *still* have used the
*current* values in vcpu->arch.hv_clock with the *current* value of
the guest TSC, won't it?

Signed-off-by: David Woodhouse <dwmw@xxxxxxxxxxxx>
---
As well as fixing kvmclock properly, I also want to rework some of the
timer code to remove all concept of "now" from it. The guest tells us a
time in (its) kvmclock nanoseconds at which it wants the timer to go
off. We subtract its kvmclock "now" from that, call ktime_get() at a
slightly different "now", then set an hrtimer for that now + the delta.

It'd be much nicer to just calculate the host CLOCK_MONOTONIC_RAW time
at which the guest's kvmclock will be the value that it asked for,
without any of that sloppiness or ever using the word "now".


 arch/x86/kvm/xen.c | 31 ++++++++++++++++++++++++++++---
 1 file changed, 28 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index 0ea6016ad132..0c69781c1d07 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -24,6 +24,7 @@
 #include <xen/interface/sched.h>
 
 #include <asm/xen/cpuid.h>
+#include <asm/pvclock.h>
 
 #include "cpuid.h"
 #include "trace.h"
@@ -144,6 +145,30 @@ static enum hrtimer_restart xen_timer_callback(struct hrtimer *timer)
 	return HRTIMER_NORESTART;
 }
 
+/*
+ * get_kvmclock_ns() is broken when the guest TSC is scaled.
+ *
+ * The guest sees a *scaled* TSC (passed through one mul/shift conversion)
+ * and then performs another mul/shift conversion according to the pvclock
+ * information that KVM provided, to get its kvmclock in nanoseconds. That
+ * is the clock it uses for setting its timers.
+ *
+ * get_kvmclock_ns(), on the other hand, just uses a *single* mul/shift
+ * conversion to go straight from host TSC to nanoseconds. There is a
+ * systemic error due to doing it differently to the guest, eventually
+ * resulting in guest timers going off at the *wrong* times, according
+ * to the guest's idea of what its kvmclock (i.e. Xen clock) is.
+ *
+ * Fixing kvmclock is hard. In the very short term, until we can do that,
+ * just don't use it for the case where its brokenness matters most, the
+ * Xen timers.
+ */
+static uint64_t vcpu_get_kvmclock_ns(struct kvm_vcpu *vcpu)
+{
+	uint64_t guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
+	return __pvclock_read_cycles(&vcpu->arch.hv_clock, guest_tsc);
+}
+
 static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs, s64 delta_ns)
 {
 	atomic_set(&vcpu->arch.xen.timer_pending, 0);
@@ -924,7 +949,7 @@ int kvm_xen_vcpu_set_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data)
 		if (data->u.timer.port && data->u.timer.expires_ns)
 			kvm_xen_start_timer(vcpu, data->u.timer.expires_ns,
 					    data->u.timer.expires_ns -
-					    get_kvmclock_ns(vcpu->kvm));
+					    vcpu_get_kvmclock_ns(vcpu));
 
 		r = 0;
 		break;
@@ -1374,7 +1399,7 @@ static bool kvm_xen_hcall_vcpu_op(struct kvm_vcpu *vcpu, bool longmode, int cmd,
 			return true;
 		}
 
-		delta = oneshot.timeout_abs_ns - get_kvmclock_ns(vcpu->kvm);
+		delta = oneshot.timeout_abs_ns - vcpu_get_kvmclock_ns(vcpu);
 		if ((oneshot.flags & VCPU_SSHOTTMR_future) && delta < 0) {
 			*r = -ETIME;
 			return true;
@@ -1404,7 +1429,7 @@ static bool kvm_xen_hcall_set_timer_op(struct kvm_vcpu *vcpu, uint64_t timeout,
 		return false;
 
 	if (timeout) {
-		uint64_t guest_now = get_kvmclock_ns(vcpu->kvm);
+		uint64_t guest_now = vcpu_get_kvmclock_ns(vcpu);
 		int64_t delta = timeout - guest_now;
 
 		/* Xen has a 'Linux workaround' in do_set_timer_op() which
-- 
2.41.0


Attachment:
smime.p7s

Description: S/MIME cryptographic signature