Re: [PATCH] KVM: x86: Refine calculation of guest wall clock to use a single TSC read

David Woodhouse <dwmw2@xxxxxxxxxxxxx> · Mon, 02 Oct 2023 08:24:03 +0100

On Sun, 2023-10-01 at 22:14 -0700, Dongli Zhang wrote:
> > They look sane enough but there's still skew over time (in both of
> > them) as the KVM values get adjusted in presumably similarly sloppy
> > ways. But we'll get to this. This patch is just the first low hanging
> 
> About the "skew over time", would you mean the results of
> kvm_get_wall_clock_epoch() keeps changing over time?
> 
> Although without testing, I suspect it is because of two reasons:
> 
> 1. Would you mind explaining what does "as the KVM values get adjusted" mean?

I hadn't quite worked it out yet. You'll note that part was below the
'---' of the patch and wasn't quite as coherent. I suspect you already
know, though.

> The kvm_get_walltime_and_clockread() call is based host monotonic clock, which
> may be adjusted (unlike raw monotonic).

Well, walltime has other problems too, which is why it's actually the
wrong thing to use for the KVM_CLOCK_REALTIME feature of
get_kvmclock(). When a leap second occurs, certain values of
CLOCK_REALTIME are *ambiguous* — they occur twice, just like certain
times between 1am and 2am on the morning when the clocks go backwards
in the winter. We should have been using CLOCK_TAI for that one. But
that's a different issue...

> 2. The host monotonic clock and kvmclock may use different mult/shift.
> 
> The equation is A-B.
> 
> A is the current host wall clock, while B is for how long the VM has boot.
> 
> A-B will be the wallclock when VM is boot.
> 
> 
> A: ts.tv_nsec + NSEC_PER_SEC * ts.tv_sec       --> monotonic clock
> B: __pvclock_read_cycles(&hv_clock, host_tsc); --> raw monotonic and kvmclock
> 
> 
> The A is from kvm_get_walltime_and_clockread() to get a pair of ns and tsc. It
> is based on monotonic clock, e.g., gtod->clock.shift and gtod->clock.mult.
> 
> BTW, the master clock is derived from raw monotonic, which uses
> gtod->raw_clock.shift and gtod->raw_clock.mult.
> 
> However, the incremental between host_tsc and master clock will be based on the
> mult/shift from kvmclock (indeed kvm_get_time_scale()).
> 
> Ideally, we may expect A and B increase in the same speed. Due to that they may
> use different mult/shift/equation, A and B may increase in the different speed.

Agreed. There'll be a small drift but it seemed larger than I expected.
I attempted to post an RFC a few days ago but it didn't seem to make it
to the list. Quoting from it...

But... it's *drifting*. If I run the xen_shinfo_selftest, which keeps
moving the shared_info around and causing the wallclock information to
be rewritten, I see it start with...

 $ sudo dmesg -w | head -10
[  611.980957] KVM 000000004ab4eef1: Calculated wall epoch 1695916533777068875 (1695916533777068623)
[  611.980995] KVM 000000004ab4eef1: Calculated wall epoch 1695916533777068874 (1695916533777068743)
[  611.981007] KVM 000000004ab4eef1: Calculated wall epoch 1695916533777068873 (1695916533777068787)
[  611.981017] KVM 000000004ab4eef1: Calculated wall epoch 1695916533777068873 (1695916533777068784)
[  611.981027] KVM 000000004ab4eef1: Calculated wall epoch 1695916533777068874 (1695916533777068786)
[  611.981036] KVM 000000004ab4eef1: Calculated wall epoch 1695916533777068873 (1695916533777068786)
[  611.981046] KVM 000000004ab4eef1: Calculated wall epoch 1695916533777068873 (1695916533777068786)
[  611.981055] KVM 000000004ab4eef1: Calculated wall epoch 1695916533777068873 (1695916533777068786)
[  611.981065] KVM 000000004ab4eef1: Calculated wall epoch 1695916533777068873 (1695916533777068786)
[  611.981075] KVM 000000004ab4eef1: Calculated wall epoch 1695916533777068873 (1695916533777068782)
 $ dmesg | tail
[  615.679572] KVM 000000004ab4eef1: Calculated wall epoch 1695916533777037455 (1695916533777037423)
[  615.679575] KVM 000000004ab4eef1: Calculated wall epoch 1695916533777037455 (1695916533777037419)
[  615.679580] KVM 000000004ab4eef1: Calculated wall epoch 1695916533777037454 (1695916533777037418)
[  615.679583] KVM 000000004ab4eef1: Calculated wall epoch 1695916533777037454 (1695916533777037418)
[  615.679587] KVM 000000004ab4eef1: Calculated wall epoch 1695916533777037454 (1695916533777037418)
[  615.679590] KVM 000000004ab4eef1: Calculated wall epoch 1695916533777037454 (1695916533777037419)
[  615.679594] KVM 000000004ab4eef1: Calculated wall epoch 1695916533777037454 (1695916533777037417)
[  615.679598] KVM 000000004ab4eef1: Calculated wall epoch 1695916533777037454 (1695916533777037418)
[  615.679605] KVM 000000004ab4eef1: Calculated wall epoch 1695916533777037454 (1695916533777037423)
[  615.679611] KVM 000000004ab4eef1: Calculated wall epoch 1695916533777037454 (1695916533777037420)

Now, I suppose it should drift a little bit, because it includes the
cumulative delta between CLOCK_REALTIME which is adjusted by NTP (and
even enjoys leap seconds), and CLOCK_MONOTONIC_RAW which has neither of
those things. But should it move by about 31µs in the space of 4
seconds?

 $ ntpfrob  -d
time = 1695917599.000628357
verbose = "Clock synchronized, no leap second adjustment pending."
time offset (ns) = 0
TAI offset = 37
frequency offset = -515992
error max (us) = 20056
error est (us) = 1083
clock cmd status = 0
pll constant = 2
clock precision (us) = 1
clock tolerance = 32768000
tick (us) = 10000

+		printk("KVM %p: Calculated wall epoch %lld (%lld)\n", kvm,
+		       ts.tv_nsec + NSEC_PER_SEC * ts.tv_sec - __pvclock_read_cycles(&hv_clock, host_tsc),
+		       ktime_get_real_ns() - get_kvmclock_ns(kvm));

> About the 2nd reason, I have a patch in progress to refresh the master clock
> periodically, for the clock drift during CPU hotplug.
> 
> https://lore.kernel.org/all/20230926230649.67852-1-dongli.zhang@xxxxxxxxxx/

Yeah, that's probably related to what I'm seeing. I'm testing by using
the xen_shinfo_test which keeps relocating the Xen shared_info page,
and *does* keep triggering a master clock update. Although now I look
harder at it, I'm not sure *why* it does so.

But *why* does the kvmclock use a different mult/shift to the
monotonic_raw clock? Surely it should be the same? I thought the
*point* in using CLOCK_MONOTONIC_RAW was that it was basically just a
direct usage of the host TSC without NTP adjustments...?

We're also seeing the clocksource watchdog trigger in Xen guests,
complaining that the Xen vs. tsc clocksources are drifting w.r.t. one
another — when again they're both supposed to be derived from the
*same* guest TSC without interference. I assume that's the same
problem? And your trick of adjusting what we advertise to the guest
over time won't help there, because that won't stop it actually
drifting w.r.t. the raw TSC that the guest is comparing against.

[ 7200.708099] clocksource: timekeeping watchdog on CPU3: Marking clocksource 'tsc' as unstable because the skew is too large:
[ 7200.708134] clocksource:                       'xen' wd_nsec: 508905888 wd_now: 68dcc1c556c wd_last: 68dadc70bcc mask: ffffffffffffffff
[ 7200.708139] clocksource:                       'tsc' cs_nsec: 511584233 cs_now: f12d5ec8ad3 cs_last: f128fca662b mask: ffffffffffffffff
[ 7200.708142] clocksource:                       'xen' (not 'tsc') is current clocksource.
[ 7200.708145] tsc: Marking TSC unstable due to clocksource watchdog
Attachment:
smime.p7s

Description: S/MIME cryptographic signature