From: David Woodhouse <dwmw@xxxxxxxxxxxx> The get_kvmclock_ns() function is broken. It isn't *just* the general brokenness of the KVM clock which sometimes gets yanked back into line with the host's CLOCK_MONOTONIC_RAW, potentially causing time to go backwards for the guest. It's worse than that; even when the clocks don't get resynced, get_kvmclock_ns() calculates time differently to the way the guest does, causing a systemic error. The guest sees a TSC value scaled from the host TSC by a mul/shift conversion (hopefully done in hardware). It then converts that guest TSC value into nanoseconds using the mul/shift conversion given to it by the KVM pvclock information. But get_kvmclock_ns() performs only a single conversion directly from host TSC to nanoseconds, giving a different result. A test program at http://david.woodhou.se/tsdrift.c demonstrates the cumulative error over a day. From the guest, http://david.woodhou.se/timerlat.c demonstrates ever increasing inaccuracies in the length of the actual delay observed when asking for e.g. 100µs of sleep. root@ip-10-0-193-21:~# ./timerlat -c -n 5 00000000 latency 103243/100000 (3.2430%) 00000001 latency 103243/100000 (3.2430%) 00000002 latency 103242/100000 (3.2420%) 00000003 latency 103245/100000 (3.2450%) 00000004 latency 103245/100000 (3.2450%) As a short-term fix, just calculate the kvmclock precisely the same way the guest does when setting timers. XX: does this even need to be conditional on constant TSC? Even if use_master_clock isn't true, the guest will *still* have used the *current* values in vcpu->arch.hv_clock with the *current* value of the guest TSC, won't it? Signed-off-by: David Woodhouse <dwmw@xxxxxxxxxxxx> --- As well as fixing kvmclock properly, I also want to rework some of the timer code to remove all concept of "now" from it. The guest tells us a time in (its) kvmclock nanoseconds at which it wants the timer to go off. We subtract its kvmclock "now" from that, call ktime_get() at a slightly different "now", then set an hrtimer for that now + the delta. It'd be much nicer to just calculate the host CLOCK_MONOTONIC_RAW time at which the guest's kvmclock will be the value that it asked for, without any of that sloppiness or ever using the word "now". arch/x86/kvm/xen.c | 31 ++++++++++++++++++++++++++++--- 1 file changed, 28 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c index 0ea6016ad132..0c69781c1d07 100644 --- a/arch/x86/kvm/xen.c +++ b/arch/x86/kvm/xen.c @@ -24,6 +24,7 @@ #include <xen/interface/sched.h> #include <asm/xen/cpuid.h> +#include <asm/pvclock.h> #include "cpuid.h" #include "trace.h" @@ -144,6 +145,30 @@ static enum hrtimer_restart xen_timer_callback(struct hrtimer *timer) return HRTIMER_NORESTART; } +/* + * get_kvmclock_ns() is broken when the guest TSC is scaled. + * + * The guest sees a *scaled* TSC (passed through one mul/shift conversion) + * and then performs another mul/shift conversion according to the pvclock + * information that KVM provided, to get its kvmclock in nanoseconds. That + * is the clock it uses for setting its timers. + * + * get_kvmclock_ns(), on the other hand, just uses a *single* mul/shift + * conversion to go straight from host TSC to nanoseconds. There is a + * systemic error due to doing it differently to the guest, eventually + * resulting in guest timers going off at the *wrong* times, according + * to the guest's idea of what its kvmclock (i.e. Xen clock) is. + * + * Fixing kvmclock is hard. In the very short term, until we can do that, + * just don't use it for the case where its brokenness matters most, the + * Xen timers. + */ +static uint64_t vcpu_get_kvmclock_ns(struct kvm_vcpu *vcpu) +{ + uint64_t guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc()); + return __pvclock_read_cycles(&vcpu->arch.hv_clock, guest_tsc); +} + static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs, s64 delta_ns) { atomic_set(&vcpu->arch.xen.timer_pending, 0); @@ -924,7 +949,7 @@ int kvm_xen_vcpu_set_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data) if (data->u.timer.port && data->u.timer.expires_ns) kvm_xen_start_timer(vcpu, data->u.timer.expires_ns, data->u.timer.expires_ns - - get_kvmclock_ns(vcpu->kvm)); + vcpu_get_kvmclock_ns(vcpu)); r = 0; break; @@ -1374,7 +1399,7 @@ static bool kvm_xen_hcall_vcpu_op(struct kvm_vcpu *vcpu, bool longmode, int cmd, return true; } - delta = oneshot.timeout_abs_ns - get_kvmclock_ns(vcpu->kvm); + delta = oneshot.timeout_abs_ns - vcpu_get_kvmclock_ns(vcpu); if ((oneshot.flags & VCPU_SSHOTTMR_future) && delta < 0) { *r = -ETIME; return true; @@ -1404,7 +1429,7 @@ static bool kvm_xen_hcall_set_timer_op(struct kvm_vcpu *vcpu, uint64_t timeout, return false; if (timeout) { - uint64_t guest_now = get_kvmclock_ns(vcpu->kvm); + uint64_t guest_now = vcpu_get_kvmclock_ns(vcpu); int64_t delta = timeout - guest_now; /* Xen has a 'Linux workaround' in do_set_timer_op() which -- 2.41.0
Attachment:
smime.p7s
Description: S/MIME cryptographic signature