On Fri, 2023-12-15 at 09:07 +0000, Durrant, Paul wrote: > On 14/12/2023 16:54, David Woodhouse wrote: > > From: David Woodhouse <dwmw@xxxxxxxxxxxx> > > > > A test program such as http://david.woodhou.se/timerlat.c confirms user > > reports that timers are increasingly inaccurate as the lifetime of a > > guest increases. Reporting the actual delay observed when asking for > > 100µs of sleep, it starts off OK on a newly-launched guest but gets > > worse over time, giving incorrect sleep times: > > > > root@ip-10-0-193-21:~# ./timerlat -c -n 5 > > 00000000 latency 103243/100000 (3.2430%) > > 00000001 latency 103243/100000 (3.2430%) > > 00000002 latency 103242/100000 (3.2420%) > > 00000003 latency 103245/100000 (3.2450%) > > 00000004 latency 103245/100000 (3.2450%) > > > > The biggest problem is that get_kvmclock_ns() returns inaccurate values > > when the guest TSC is scaled. The guest sees a TSC value scaled from the > > host TSC by a mul/shift conversion (hopefully done in hardware). The > > guest then converts that guest TSC value into nanoseconds using the > > mul/shift conversion given to it by the KVM pvclock information. > > > > But get_kvmclock_ns() performs only a single conversion directly from > > host TSC to nanoseconds, giving a different result. A test program at > > http://david.woodhou.se/tsdrift.c demonstrates the cumulative error > > over a day. > > > > It's non-trivial to fix get_kvmclock_ns(), although I'll come back to > > that. The actual guest hv_clock is per-CPU, and *theoretically* each > > vCPU could be running at a *different* frequency. But this patch is > > needed anyway because... > > > > The other issue with Xen timers was that the code would snapshot the > > host CLOCK_MONOTONIC at some point in time, and then... after a few > > interrupts may have occurred, some preemption perhaps... would also read > > the guest's kvmclock. Then it would proceed under the false assumption > > that those two happened at the *same* time. Any time which *actually* > > elapsed between reading the two clocks was introduced as inaccuracies > > in the time at which the timer fired. > > > > Fix it to use a variant of kvm_get_time_and_clockread(), which reads the > > host TSC just *once*, then use the returned TSC value to calculate the > > kvmclock (making sure to do that the way the guest would instead of > > making the same mistake get_kvmclock_ns() does). > > > > Sadly, hrtimers based on CLOCK_MONOTONIC_RAW are not supported, so Xen > > timers still have to use CLOCK_MONOTONIC. In practice the difference > > between the two won't matter over the timescales involved, as the > > *absolute* values don't matter; just the delta. > > > > This does mean a new variant of kvm_get_time_and_clockread() is needed; > > called kvm_get_monotonic_and_clockread() because that's what it does. > > > > Fixes: 536395260582 ("KVM: x86/xen: handle PV timers oneshot mode") > > Signed-off-by: David Woodhouse <dwmw@xxxxxxxxxxxx> > > --- > > v3: > > • Rebase and repost. > > > > v2: > > • Fall back to get_kvmclock_ns() if vcpu-arch.hv_clock isn't set up > > yet, with a big comment explaining why that's actually OK. > > • Fix do_monotonic() *not* to add the boot time offset. > > • Rename do_monotonic_raw() → do_kvmclock_base() and add a comment > > to make it clear that it *does* add the boot time offset. That > > was just left as a bear trap for the unwary developer, wasn't it? > > > > arch/x86/kvm/x86.c | 61 +++++++++++++++++++++-- > > arch/x86/kvm/x86.h | 1 + > > arch/x86/kvm/xen.c | 121 ++++++++++++++++++++++++++++++++++----------- > > 3 files changed, 149 insertions(+), 34 deletions(-) > > > > Reviewed-by: Paul Durrant <paul@xxxxxxx> Ping?
Attachment:
smime.p7s
Description: S/MIME cryptographic signature