Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm

David Woodhouse <dwmw2@xxxxxxxxxxxxx> · Thu, 30 Jun 2022 12:58:52 +0100

On Tue, 2022-03-29 at 09:02 -0700, Oliver Upton wrote:
> 
> There's a need to sound the alarm for NTP regardless of whether
> TOLERABLE_THRESHOLD is exceeded. David pointed out that the host
> advancing the guest clocks (delta injection or TSC advancement) could
> inject some error. Also, hardware has likely changed and the new parts
> will have their own errors as well.

I don't admit to pointing that out in that form because I don't accept
the use of the term "advance".

Clocks advance *themselves*. That's what clocks do.

When we perform a live update or live migration we might *adjust* those
clocks, calibrate, synchronise or restore them. But I'll eat my
keyboard before using the term "advance" for that. Even if our
adjustment is in a forward direction.

Let's consider the case of a live update — where we stop scheduling the
guest for a moment, kexec into the new kernel, then resume scheduling
the guest.

I assert strongly that from the guest point of view this is *no*
different to any other brief period of not being scheduled.

Yes, in practice we have a whole new kernel, a whole new KVM and set of
kvm_vcpus, and we've *restored* the state. And we have restored the
TSCs/clocks in those new kvm objects to precisely match what they were
before. Note: *match* not advance.

Before the kexec, there were a bunch of relationships between clocks,
mostly based on the host TSC Tₕ (assuming the case where that's stable
and reliable):

 • The kernel's idea of wallclock time was based on Tₕ, plus some
   offset and divided by some frequency. NTP tweaks those values over
   time but at any given instant there is a current value for them
   which is used to derive the wallblock time.

 • The kernel's idea of the guest kvmclock epoch (nanoseconds since the
   KVM started) was based on Tₕ and some other offset and hopefully the
   same frequency. 

 • The TSC of each vCPU was based on Tₕ, some offset and a TSC scaling
   factor.

After a live update, the host TSC Tₕ is just the same as it always was.
Not the same *value* of course; that was never the case from one tick
to the next anyway. It's the same, in that it continues to advance
*itself* at a consistent frequency as real time progresses, which is
what clocks do.

In the new kernel we just want all those other derivative clocks to
*also* be the same as before. That is, the offset and multipliers are
the *same* value. We're not "advancing" those clocks. We're
*preserving* them.

For live migration it's slightly harder because we don't have a
consistent host TSC to use as the basis. The best we can do is NTP-
synchronised wallclock time between the two hosts. And thus I think we
want *these* constants to be preserved across the migration:

 The KVM's kvmclock was <K> at a given wallclock time <W>

 The TSC of each vCPU#n was <Tₙ> at a given value of kvmclock <Kₙ>

In *this* case we are running on different hardware and the reliance on
the NTP wallclock time as the basis for preserving the guest clocks may
have introduced an error, as well as the fact that the hardware has
changed. So in this case we should indeed inform the guest that it
should consider itself out of NTP sync and start over, in *addition* to
making a best effort to preserve those clocks.

But there is no scope for the word "advance" to be used anywhere there
either.

Attachment:
smime.p7s

Description: S/MIME cryptographic signature