Re: [PATCH 2/2] KVM: selftests: Add KVM/PV clock selftest to prove timer drift correction

David Woodhouse <dwmw2@xxxxxxxxxxxxx> · Wed, 24 Apr 2024 13:58:24 +0100

On Mon, 2024-04-22 at 15:02 -0700, Chen, Zide wrote:
> the selftest works for me, and I ran the test for 1000+ iterations,
> w/ or w/o TSC scaling, the TEST_ASSERT(delta_corrected <= ±1) never
> got hit. This is awesome!

I think that with further care we can get even better than that.

Let's look at where that ±1ns tolerance comes from.

Consider a 3GHz TSC. That gives us three ticks per nanosecond. Each TSC
value can be seen as (3n) (3n+1) or (3n+2) for a given nanosecond n.

If we take a new reference point at a (3n+2) TSC value and calculate
the KVM clock from that, we *know* we're going to round down and lose
two-thirds of a nanosecond.

So then we set the new KVM clock parameters to use that new reference
point, and that's why we have to allow a disruption of up to a single
nanosecond. In fact, I don't think it's ±1 ns, is it? It'll only ever
be in the same direction (rounding down)?

But if we're careful which *which* TSC value we use as the reference
point, we can reduce that error.

The TSC value we use should be *around* the current time, but what if
we were to evaluate maybe the previous 100 TSC values. Pass *each* of
them through the conversion to nanoseconds and use the one that comes
*closest* to a precise nanosecond (nnnnnnnn.000).

It's even fairly easy to calculate those, because of the way the KVM
clock ABI has us multiply and then shift right by 32 bits. We just need
to look at those low 32 bits (the fractional nanosecond) *before*
shifting them out of existence. Something like...

   uint64_t tsc_candidate, tsc_candidate_last, best_tsc;
   uint32_t frac_ns_min = 0xffffffff;
   uint64_t frac_ns;

   best_tsc = tsc_candidate = rdtsc();
   tsc_candidate_last = tsc_candidate - 100;

   while (tsc_candidate-- > tsc_candidate_last) {
      uint64_t guest_tsc = kvm_scale_tsc(tsc_candidate, ...);
      frac_ns = guest_tsc * hvclock->tsc_to_system_mul;
      /* Shift *after* multiplication, not before as pvclock_scale_cycles() does. */
      if (hvclock->tsc_shift < 0)
          frac_ns >>= -hvclock->tsc_shift;
      else
          frac_ns <<= hvclock->tsc_shift;

      if ( (uint32_t)frac_ns <= frac_ns_min ) {
          frac_ns_min = frac_ns;
          best_tsc = tsc_candidate;
      }
   }
   printk("Best TSC to use for reference point is %lld", best_tsc);

And then you calculate your CLOCK_MONOTONIC_RAW and guest KVM clock
from *that* host TSC value, and thus minimise the discrepancies due to
rounding down?

Aside from the fact that I literally just typed that into email and
it's barely even thought through let alone entirely untested... I'm
just not sure it's even worth the runtime cost, for that ±1 ns on a
rare case.

A slop of ±1ns is probably sufficient because over the past few years
we've already shifted the definition of the KVM clock to *not* be NTP-
corrected, and we leave guests to do fine-grained synchronization
through other means anyway.

But I see talk of people offering a PPS signal to *hundreds* of guests
on the same host simultaneously, just for them all to use it to
calibrate the same underlying oscillator. Which is a little bit insane.

We *should* have a way for the host to do that once and then expose the
precise time to its guests, in a much saner way than the KVM clock
does. I'll look at adding something along those lines to this series
too, which can be driven from the host's adjtimex() adjustments (which
KVM already consumes), and fed into each guest's timekeeping as a
PTP/PPS device or something.

Attachment:
smime.p7s

Description: S/MIME cryptographic signature