Re: [PATCH v3] KVM: x86/xen: improve accuracy of Xen timers

"Durrant, Paul" <xadimgnik@xxxxxxxxx> · Fri, 15 Dec 2023 09:07:25 +0000

On 14/12/2023 16:54, David Woodhouse wrote:
From: David Woodhouse <dwmw@xxxxxxxxxxxx>

A test program such as http://david.woodhou.se/timerlat.c confirms user
reports that timers are increasingly inaccurate as the lifetime of a
guest increases. Reporting the actual delay observed when asking for
100µs of sleep, it starts off OK on a newly-launched guest but gets
worse over time, giving incorrect sleep times:

root@ip-10-0-193-21:~# ./timerlat -c -n 5
00000000 latency 103243/100000 (3.2430%)
00000001 latency 103243/100000 (3.2430%)
00000002 latency 103242/100000 (3.2420%)
00000003 latency 103245/100000 (3.2450%)
00000004 latency 103245/100000 (3.2450%)

The biggest problem is that get_kvmclock_ns() returns inaccurate values
when the guest TSC is scaled. The guest sees a TSC value scaled from the
host TSC by a mul/shift conversion (hopefully done in hardware). The
guest then converts that guest TSC value into nanoseconds using the
mul/shift conversion given to it by the KVM pvclock information.

But get_kvmclock_ns() performs only a single conversion directly from
host TSC to nanoseconds, giving a different result. A test program at
http://david.woodhou.se/tsdrift.c demonstrates the cumulative error
over a day.

It's non-trivial to fix get_kvmclock_ns(), although I'll come back to
that. The actual guest hv_clock is per-CPU, and *theoretically* each
vCPU could be running at a *different* frequency. But this patch is
needed anyway because...

The other issue with Xen timers was that the code would snapshot the
host CLOCK_MONOTONIC at some point in time, and then... after a few
interrupts may have occurred, some preemption perhaps... would also read
the guest's kvmclock. Then it would proceed under the false assumption
that those two happened at the *same* time. Any time which *actually*
elapsed between reading the two clocks was introduced as inaccuracies
in the time at which the timer fired.

Fix it to use a variant of kvm_get_time_and_clockread(), which reads the
host TSC just *once*, then use the returned TSC value to calculate the
kvmclock (making sure to do that the way the guest would instead of
making the same mistake get_kvmclock_ns() does).

Sadly, hrtimers based on CLOCK_MONOTONIC_RAW are not supported, so Xen
timers still have to use CLOCK_MONOTONIC. In practice the difference
between the two won't matter over the timescales involved, as the
*absolute* values don't matter; just the delta.

This does mean a new variant of kvm_get_time_and_clockread() is needed;
called kvm_get_monotonic_and_clockread() because that's what it does.

Fixes: 536395260582 ("KVM: x86/xen: handle PV timers oneshot mode")
Signed-off-by: David Woodhouse <dwmw@xxxxxxxxxxxx>
---
v3:
   • Rebase and repost.

v2:
   • Fall back to get_kvmclock_ns() if vcpu-arch.hv_clock isn't set up
     yet, with a big comment explaining why that's actually OK.
   • Fix do_monotonic() *not* to add the boot time offset.
   • Rename do_monotonic_raw() → do_kvmclock_base() and add a comment
     to make it clear that it *does* add the boot time offset. That
     was just left as a bear trap for the unwary developer, wasn't it?

  arch/x86/kvm/x86.c |  61 +++++++++++++++++++++--
  arch/x86/kvm/x86.h |   1 +
  arch/x86/kvm/xen.c | 121 ++++++++++++++++++++++++++++++++++-----------
  3 files changed, 149 insertions(+), 34 deletions(-)

Reviewed-by: Paul Durrant <paul@xxxxxxx>