On Fri, 2023-10-27 at 14:57 +0100, David Woodhouse wrote: > As well as fixing kvmclock properly, I also want to rework some of the > timer code to remove all concept of "now" from it. The guest tells us a > time in (its) kvmclock nanoseconds at which it wants the timer to go > off. We subtract its kvmclock "now" from that, call ktime_get() at a > slightly different "now", then set an hrtimer for that now + the delta. > > It'd be much nicer to just calculate the host CLOCK_MONOTONIC_RAW time > at which the guest's kvmclock will be the value that it asked for, > without any of that sloppiness or ever using the word "now". Actually, with the slight wrinkle that we can't use CLOCK_MONOTONIC_RAW with hrtimers, that actually works out fairly simple and even makes the "don't use get_kvmclock_ns()" part a little less blatant. From: David Woodhouse <dwmw@xxxxxxxxxxxx> Subject: [PATCH] KVM: x86/xen: improve accuracy of Xen timers A test program such as http://david.woodhou.se/timerlat.c confirms user reports that timers are increasingly inaccurate as the lifetime of a guest increases. Reporting the actual delay observed when asking for 100µs of sleep, it starts off OK on a newly-launched guest but gets worse over time: root@ip-10-0-193-21:~# ./timerlat -c -n 5 00000000 latency 103243/100000 (3.2430%) 00000001 latency 103243/100000 (3.2430%) 00000002 latency 103242/100000 (3.2420%) 00000003 latency 103245/100000 (3.2450%) 00000004 latency 103245/100000 (3.2450%) The biggest problem is that get_kvmclock_ns() returns inaccurate values when the guest TSC is scaled. The guest sees a TSC value scaled from the host TSC by a mul/shift conversion (hopefully done in hardware). It then converts that guest TSC value into nanoseconds using the mul/shift conversion given to it by the KVM pvclock information. But get_kvmclock_ns() performs only a single conversion directly from host TSC to nanoseconds, giving a different result. A test program at http://david.woodhou.se/tsdrift.c demonstrates the cumulative error over a day. It's non-trivial to fix get_kvmclock_ns(), although I'll come back to that. The actual guest hv_clock is per-CPU, and *theoretically* each vCPU could be running at a *different* frequency. The other issue with Xen timers was that the code would snapshot the host CLOCK_MONOTONIC at some point in time, and then... after a few interrupts may have occurred, some preemption perhaps... would also read the guest's kvmclock. Then it would proceed under the false assumption that those two happened at the *same* time. Any time which *actually* elapsed between reading the two clocks was introduced as inaccuracies in the time at which the timer fired. Fix it to use a variant of kvm_get_time_and_clockread(), which reads the host TSC just *once*, then use the returned TSC value to calculate the kvmclock (making sure to do that the way the guest would instead of making the same mistake get_kvmclock_ns() does. Sadly, the kernel doesn't let us have hrtimers on CLOCK_MONOTONIC_RAW, so use CLOCK_MONOTONIC. In practice the slight difference between the two won't matter over the timescales involved, as we don't care about the *absolute* values; just the delta. This does mean a new variant of kvm_get_time_and_clockread() is needed; called kvm_get_monotonic_and_clockread() because that's what it does. Signed-off-by: David Woodhouse <dwmw@xxxxxxxxxxxx> --- arch/x86/kvm/x86.c | 30 ++++++++++++++ arch/x86/kvm/x86.h | 1 + arch/x86/kvm/xen.c | 99 ++++++++++++++++++++++++++++++---------------- 3 files changed, 97 insertions(+), 33 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 9f18b06bbda6..5b706b05f64c 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2863,6 +2863,25 @@ static int do_monotonic_raw(s64 *t, u64 *tsc_timestamp) return mode; } +static int do_monotonic(s64 *t, u64 *tsc_timestamp) +{ + struct pvclock_gtod_data *gtod = &pvclock_gtod_data; + unsigned long seq; + int mode; + u64 ns; + + do { + seq = read_seqcount_begin(>od->seq); + ns = gtod->clock.base_cycles; + ns += vgettsc(>od->clock, tsc_timestamp, &mode); + ns >>= gtod->clock.shift; + ns += ktime_to_ns(ktime_add(gtod->clock.offset, gtod->offs_boot)); + } while (unlikely(read_seqcount_retry(>od->seq, seq))); + *t = ns; + + return mode; +} + static int do_realtime(struct timespec64 *ts, u64 *tsc_timestamp) { struct pvclock_gtod_data *gtod = &pvclock_gtod_data; @@ -2895,6 +2914,17 @@ static bool kvm_get_time_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp) tsc_timestamp)); } +/* returns true if host is using TSC based clocksource */ +bool kvm_get_monotonic_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp) +{ + /* checked again under seqlock below */ + if (!gtod_is_based_on_tsc(pvclock_gtod_data.clock.vclock_mode)) + return false; + + return gtod_is_based_on_tsc(do_monotonic(kernel_ns, + tsc_timestamp)); +} + /* returns true if host is using TSC based clocksource */ static bool kvm_get_walltime_and_clockread(struct timespec64 *ts, u64 *tsc_timestamp) diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h index 1e7be1f6ab29..c08c6f729965 100644 --- a/arch/x86/kvm/x86.h +++ b/arch/x86/kvm/x86.h @@ -293,6 +293,7 @@ static inline bool kvm_check_has_quirk(struct kvm *kvm, u64 quirk) void kvm_inject_realmode_interrupt(struct kvm_vcpu *vcpu, int irq, int inc_eip); u64 get_kvmclock_ns(struct kvm *kvm); +bool kvm_get_monotonic_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp); int kvm_read_guest_virt(struct kvm_vcpu *vcpu, gva_t addr, void *val, unsigned int bytes, diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c index 0ea6016ad132..46d68a30ec84 100644 --- a/arch/x86/kvm/xen.c +++ b/arch/x86/kvm/xen.c @@ -24,6 +24,7 @@ #include <xen/interface/sched.h> #include <asm/xen/cpuid.h> +#include <asm/pvclock.h> #include "cpuid.h" #include "trace.h" @@ -144,17 +145,75 @@ static enum hrtimer_restart xen_timer_callback(struct hrtimer *timer) return HRTIMER_NORESTART; } -static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs, s64 delta_ns) +static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs, + bool linux_wa) { + uint64_t guest_now, host_tsc, guest_tsc; + int64_t kernel_now, delta; + + /* + * The guest provides the requested timeout in absolute nanoseconds + * of the KVM clock — as *it* sees it, based on the scaled TSC and + * the pvclock information provided by KVM. + * + * The kernel doesn't support hrtimers based on CLOCK_MONOTONIC_RAW + * so use CLOCK_MONOTONIC. In the timescales covered by timers, the + * difference won't matter much as there is no cumulative effect. + * + * Calculate the time for some arbitrary point in time around "now" + * in terms of both kvmclock and CLOCK_MONOTONIC. Calculate the + * delta between the kvmclock "now" value and the guest's requested + * timeout, apply the "Linux workaround" described below, and add + * the resulting delta to the CLOCK_MONOTONIC "now" value, to get + * the absolute CLOCK_MONOTONIC time at which the timer should + * fire. + */ + if (!kvm_get_monotonic_and_clockread(&kernel_now, &host_tsc)) { + /* + * Even in this case, don't fall back to get_kvmclock_ns() + * because it's broken; it has a systemic error in its + * results because it scales directly from host TSC to + * nanoseconds, and doesn't scale first to guest TSC and + * *then* to nanoseconds as the guest does. + * + * There is a small error introduced here because time + * continues to elapse between the ktime_get() and the + * subsequent rdtsc(). + */ + kernel_now = ktime_get(); /* This is CLOCK_MONOTONIC */ + host_tsc = rdtsc(); + } + + /* Calculate the guest kvmclock as the guest would do it. */ + guest_tsc = kvm_read_l1_tsc(vcpu, host_tsc); + guest_now = __pvclock_read_cycles(&vcpu->arch.hv_clock, guest_tsc); + delta = guest_abs - guest_now; + + /* Xen has a 'Linux workaround' in do_set_timer_op() which + * checks for negative absolute timeout values (caused by + * integer overflow), and for values about 13 days in the + * future (2^50ns) which would be caused by jiffies + * overflow. For those cases, it sets the timeout 100ms in + * the future (not *too* soon, since if a guest really did + * set a long timeout on purpose we don't want to keep + * churning CPU time by waking it up). + */ + if (linux_wa) { + if ((unlikely((int64_t)guest_abs < 0 || + (delta > 0 && (uint32_t) (delta >> 50) != 0)))) { + delta = 100 * NSEC_PER_MSEC; + guest_abs = guest_now + delta; + } + } + atomic_set(&vcpu->arch.xen.timer_pending, 0); vcpu->arch.xen.timer_expires = guest_abs; - if (delta_ns <= 0) { + if (delta <= 0) { xen_timer_callback(&vcpu->arch.xen.timer); } else { - ktime_t ktime_now = ktime_get(); hrtimer_start(&vcpu->arch.xen.timer, - ktime_add_ns(ktime_now, delta_ns), + ktime_add_ns(kernel_now, delta), HRTIMER_MODE_ABS_HARD); } } @@ -923,8 +982,7 @@ int kvm_xen_vcpu_set_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data) /* Start the timer if the new value has a valid vector+expiry. */ if (data->u.timer.port && data->u.timer.expires_ns) kvm_xen_start_timer(vcpu, data->u.timer.expires_ns, - data->u.timer.expires_ns - - get_kvmclock_ns(vcpu->kvm)); + false); r = 0; break; @@ -1340,7 +1398,6 @@ static bool kvm_xen_hcall_vcpu_op(struct kvm_vcpu *vcpu, bool longmode, int cmd, { struct vcpu_set_singleshot_timer oneshot; struct x86_exception e; - s64 delta; if (!kvm_xen_timer_enabled(vcpu)) return false; @@ -1374,13 +1431,7 @@ static bool kvm_xen_hcall_vcpu_op(struct kvm_vcpu *vcpu, bool longmode, int cmd, return true; } - delta = oneshot.timeout_abs_ns - get_kvmclock_ns(vcpu->kvm); - if ((oneshot.flags & VCPU_SSHOTTMR_future) && delta < 0) { - *r = -ETIME; - return true; - } - - kvm_xen_start_timer(vcpu, oneshot.timeout_abs_ns, delta); + kvm_xen_start_timer(vcpu, oneshot.timeout_abs_ns, false); *r = 0; return true; @@ -1404,25 +1455,7 @@ static bool kvm_xen_hcall_set_timer_op(struct kvm_vcpu *vcpu, uint64_t timeout, return false; if (timeout) { - uint64_t guest_now = get_kvmclock_ns(vcpu->kvm); - int64_t delta = timeout - guest_now; - - /* Xen has a 'Linux workaround' in do_set_timer_op() which - * checks for negative absolute timeout values (caused by - * integer overflow), and for values about 13 days in the - * future (2^50ns) which would be caused by jiffies - * overflow. For those cases, it sets the timeout 100ms in - * the future (not *too* soon, since if a guest really did - * set a long timeout on purpose we don't want to keep - * churning CPU time by waking it up). - */ - if (unlikely((int64_t)timeout < 0 || - (delta > 0 && (uint32_t) (delta >> 50) != 0))) { - delta = 100 * NSEC_PER_MSEC; - timeout = guest_now + delta; - } - - kvm_xen_start_timer(vcpu, timeout, delta); + kvm_xen_start_timer(vcpu, timeout, true); } else { kvm_xen_stop_timer(vcpu); } -- 2.41.0
Attachment:
smime.p7s
Description: S/MIME cryptographic signature