Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm

"David Woodhouse" <dwmw2@xxxxxxxxxxxxx> · Sun, 20 Mar 2022 09:46:35 -0000

> The offset interface completely punts the decision around guest clocks
> to userspace. We (KVM) have absolutely no idea what userspace is about
> to do with the guest. The guest could be paused for 5 seconds or 5
> years. Encouraging host userspace to just read/write a { TOD, TSC } pair
> and let KVM do the heavy lifting could completely wreck the guest's
> monotonic clock.
>
> Additionally, it is impossible for userspace to enforce policy/limits on
> how much to time travel a guest with a value-based interface. Any event
> could sneak in between the time userspace checks the value and KVM sets
> the L1 offset. Offsets are idempotent and will still uphold userspace's
> intentions even if an inordinate amount of time elapses until KVM
> processes it.

Thanks for the detailed explanation. One part which confuses me here...
Why can't userspace impose those same limits using a (TOD, value) tuple?
Userspace can still look at that TOD from before the brownout period
started, and declare that is too far in the past.

If the event happens *after* userspace has decided that the migration was
quick enough, but before the vCPUs are actually running again, even the
offset based interface doesn't protect against that.

> Apologies for grandstanding, but clocks has been a real source of pain
> during migration. I do agree that the documented algorithm is a mess at
> the moment, given that there's no good way for userspace to transform
> host_tsc -> guest_tsc. Poking the host TSC frequency out in sysfs is
> nice to have, but probably not ABI to hang this whole thing off of.
>
> What do you folks think about having a new R/O vCPU attribute that
> returns a { TOD, guest_tsc } pair? I believe that would immediately
> satisfy the needs of upstream to implement clock-advancing live
> migration.

Hm, I need to do some more thinking here. I poked at this because for TSC
scaling even before we think about clock jumps it was just utterly hosed —
userspace naively just creates a bunch of vCPUs and sets their TSC
frequency + value, and they all end up with unsynced TSC values.

But coincidentally since then I have started having conversations with
people who really want the guest to have an immediate knowledge of the
adjtimex maxerror etc. on the new host immediately after the migration.
Maybe the "if the migration isn't fast enough then let the guest know it's
now unsynced" is OK, but I'll need to work out what "immediately" means
when we have a guest userspace component involved in it.

I'll need to do that with a real screen and keyboard though, and fingers
that aren't freezing as I sit by a 9-year-old's hockey training...

-- 
dwmw2