Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm

Oliver Upton <oupton@xxxxxxxxxx> · Sun, 20 Mar 2022 08:52:11 +0000

On Sun, Mar 20, 2022 at 09:10:15AM +0100, Paolo Bonzini wrote:
> On 3/19/22 14:13, David Woodhouse wrote:
> > 
> > 
> > > On 3/19/22 12:54, Paolo Bonzini wrote:
> > > > On 3/19/22 09:08, David Woodhouse wrote:
> > > > > If a basic API requires this much documentation, my instinct is to
> > > > > *fix* it with fire first, then document what's left.
> > > > I agree, but you're missing all the improvements that went in together
> > > > with the offset API in order to enable the ugly algorithm.
> > > > 
> > > > > A userspace-friendly API for migration would be more like KVM on the
> > > > > source host giving me { TIME_OF_DAY, TSC } and then all I have to do on
> > > > > the destination host (after providing the TSC frequency) is give it
> > > > > precisely the same data.
> > > 
> > > I guess you meant {hostTimeOfDay, hostTSC} _plus_ the constant
> > > {guestTSCScale, guestTSCOffset, guestTimeOfDayOffset}.  That would work,
> > > and in that case it wouldn't even be KVM returning that host information.
> > 
> > I would have said nobody cares about the host TSC value and frequency.
> > That is for KVM to know and deal with internally.
> 
> There are two schools as to how to do migration.  The QEMU school is to just
> load back the guest TOD and TSC and let NTP resync.  They had better be
> synced, but a difference of a few microseconds might not matter.
> 
> This has the advantage of not showing the guest that there was a pause.
> QEMU is doing it this way due to not having postcopy live migration for a
> long time; precopy is subject to longer brownout between source and
> destination, which might result in soft lockups.  Apart from this it really
> has only disadvantage.
> 
> The Google school has the destination come up with the guest TOD and TSC
> that takes into account the length of the brownout phase.  This is where the
> algorithm in Documentation/ comes into play, and why you need the host pair
> as well.  Actually Google does not use it because they already have precise
> time available to userspace as part of Spanner.  Maybe so does Amazon (?),
> but for the rest of the world the host {TOD, TSC} pair is required to
> compute what the guest TSC "should look like" on the destination.

Hey, beat me to the punch :) Paolo is pretty much spot on, but there are
a few additional details here that I believe are relevant.

I really don't think we want to effectively step the guest's monotonic
clock if at all possible. It hurts when you do this for large windows,
and leads to soft lockups as you've noted above. Nonetheless, its a
kludgy way to advance the guest's realtime clock without informing it
that it is about to experience time travel.

Given all of this, there is a limit to how much we advance the TSC in
the Google school. If this limit is exceeded we refuse to step the TSC
further and inform the guest it has experienced time travel [1]. It is
an attempt to bridge the gap and avoid completely laying waste to guest
clocks while hiding the migration if we're confident it was smooth
enough. Beyond that, guest userspace wants to be appraised of time
travel as well (TFD_TIMER_CANCEL_ON_SET). Having the guest clean up a
messy migration ensures that this all 'just works'.

The offset interface completely punts the decision around guest clocks
to userspace. We (KVM) have absolutely no idea what userspace is about
to do with the guest. The guest could be paused for 5 seconds or 5
years. Encouraging host userspace to just read/write a { TOD, TSC } pair
and let KVM do the heavy lifting could completely wreck the guest's
monotonic clock.

Additionally, it is impossible for userspace to enforce policy/limits on
how much to time travel a guest with a value-based interface. Any event
could sneak in between the time userspace checks the value and KVM sets
the L1 offset. Offsets are idempotent and will still uphold userspace's
intentions even if an inordinate amount of time elapses until KVM
processes it.

Apologies for grandstanding, but clocks has been a real source of pain
during migration. I do agree that the documented algorithm is a mess at
the moment, given that there's no good way for userspace to transform
host_tsc -> guest_tsc. Poking the host TSC frequency out in sysfs is
nice to have, but probably not ABI to hang this whole thing off of.

What do you folks think about having a new R/O vCPU attribute that
returns a { TOD, guest_tsc } pair? I believe that would immediately
satisfy the needs of upstream to implement clock-advancing live
migration.

[1]: https://github.com/GoogleCloudPlatform/guest-agent/blob/main/google_guest_agent/clock.go
--
Thanks,
Oliver