Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm

David Woodhouse <dwmw2@xxxxxxxxxxxxx> · Wed, 23 Mar 2022 12:35:17 +0000

On Tue, 2022-03-22 at 21:53 +0000, Oliver Upton wrote:
> But what happens to CLOCK_MONOTONIC in this case? We are still accepting
> the fact that live migrations destroy CLOCK_MONOTONIC if we directly
> advance the guest TSCs to account for elapsed time. The definition of
> CLOCK_MONOTONIC is that the clock does not count while the system is
> suspended. From the viewpoint of the guest, a live migration appears to
> be a forced suspend operation at an arbitrary instruction boundary.
> There is no realistic way for the guest to give the illusion that
> MONOTONIC has stopped without help from the hypervisor.

I'm a little lost there. CLOCK_MONOTONIC is *permitted* to stop when
the guest is suspended, but it's not *mandatory*, surely?

I can buy your assertion that the brownout period of a live migration
(or the time for the host kernel to kexec in the case of live update) 
can be considered a suspend... but regardless of whether that makes it
mandatory to stop the clocks, I prefer to see it a different way. 

In normal operation — especially with CPU overcommit and/or throttling
— there are times when none of the guest's vCPUs will be scheduled for
short periods of time. We don't *have* to stop CLOCK_MONOTONIC when
that happens, do we?

If we want live migration to be guest-transparent, shouldn't we treat
is as much as possible as one of those times when the vCPUs just happen
not to be running for a moment?

On a live update, where the host does a kexec and then resumes the
guest state, the host TSC reference is precisely the same as before the
upate. We basically don't want to change *anything* that the guest sees
in its pvclock information. In fact, we have a local patch to
'KVM_SET_CLOCK_FROM_GUEST' for the live update case, which ensures
exactly that. We then add a delta to the guest TSC as we create each
vCPU in the new KVM; the *offset* interface would be beneficial to us
here (since that offset doesn't change) but we're not using it yet.

For live migration, the same applies — we can just add a delta to the
clock and the guest TSC values, commensurate with the amount of
wallclock time that elapsed from serialisation on the source host, to
deserialisation on the destination.

And it all looks *just* like it would if the vCPUs happened not to be
scheduled for a little while, because the host was busy.

> > The KVM_PVCLOCK_STOPPED event should trigger a change in some of the
> > globals kept by kernel/time/ntp.c (which are visible to userspace through
> > adjtimex(2)). In particular, `time_esterror` and `time_maxerror` should get reset
> > to `NTP_PHASE_LIMIT` and time_status should get reset to `STA_UNSYNC`.
> 
> I do not disagree that NTP needs to throw the book out after a live
> migration.
> 
> But, the issue is how we convey that to the guest. KVM_PVCLOCK_STOPPED
> relies on the guest polling a shared structure, and who knows when the
> guest is going to check the structure again? If we inject an interrupt
> the guest is likely to check this state in a reasonable amount of time.

Ah, but that's the point. A flag in shared memory can be checked
whenever the guest needs to know that it's operating on valid state.
Linux will check it *every* time from pvclock_clocksource_read().

As opposed to a separate interrupt which eventually gets processed some
indefinite amount of time in the future.

> Doing this the other way around (advance the TSC, tell the guest to fix
> MONOTONIC) is fundamentally wrong, as it violates two invariants of the
> monotonic clock. Monotonic counts during a migration, which really is a
> forced suspend. Additionally, you cannot step the monotonic clock.
> 
I dont understand why we can't "step the monotonic clock". Any time
merely refrain from scheduling the vCPUs for any period of time, that
is surely indistinguishable from a "step" in the monotonic clock,
surely?

> Sorry to revisit this conversation yet again. Virtualization isn't going
> away any time soon and the illusion that migrations are invisible to the
> guest is simply not true.

I'll give you the assertion that migrations aren't completely
invisible, but I still think they should be *equivalent* to the vCPU
just not being scheduled for a moment.

Attachment:
smime.p7s

Description: S/MIME cryptographic signature