Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm

David Woodhouse <dwmw2@xxxxxxxxxxxxx> · Fri, 25 Mar 2022 09:03:50 +0000

On Wed, 2022-03-23 at 16:21 +0000, Oliver Upton wrote:
> On Wed, Mar 23, 2022 at 12:35:17PM +0000, David Woodhouse wrote:
> > And it all looks *just* like it would if the vCPUs happened not to be
> > scheduled for a little while, because the host was busy.
> 
> We could continue to get away with TSC advancement, but the critical
> part IMO is the upper bound. And what happens when we exceed it.
> 
> There is no authoritative documentation around what time looks like as a
> guest of KVM, and futhermore what happens when a guest experiences time
> travel. Now we're in a particularly undesirable situation where there
> are at least three known definitions for time during a migration
> (upstream QEMU, Google, Amazon) and it is ~impossible to program guest
> software to anticipate our shenanigans.
> 
> If we are to do this right we probably need to agree on documented
> behavior. If we decide that advancing TSCs is acceptable up to 'X'
> seconds, guest kernels could take a change to relax expectations at
> least up to this value.

I'm not sure I even buy this line of argument at all. I don't think of
it as 'TSC advancement'. Or maybe I should put that the other way
round: TSCs advance *all* the time; that's the whole point in them. 

All we do when we live-migrate is move the state from one host to
another, completely intact. From the guest point of view we don't
"advance" the TSC; we just set the guest TSC on the destination host to
precisely¹ the same value that the guest TSC on the source host would
have at that moment if we were to abort the migration and do a
rollback. Either way (rollback or not), the vCPUs weren't *running* for
a period of time, but time and TSCs are entirely the same.

In fact, on live update we literally *don't* change anything that the
guest sees. The pvclock information that the guest gets is *precisely*
the same as before, and the TSC has 'advanced' purely by nature of
being precisely the same CPU just a short period of time in the
future... because that's what TSCs do when time passes :)

Really, we aren't talking about a "maximum advancement of the TSC".
We're talking about a "maximum period for which the vCPUs aren't
scheduled".

And I don't see why there need to be documented hard limits on that. I
see it as a quality of implementation issue. If the host is in swap
death and doesn't manage to run the guest vCPUs for seconds at a time,
that's fairly crappy, but it's not strictly a *timekeeping* issue. And
it's just the same if the final transition period of live migration (or
the kexec time for live update) is excessive.

> > > > The KVM_PVCLOCK_STOPPED event should trigger a change in some of the
> > > > globals kept by kernel/time/ntp.c (which are visible to userspace through
> > > > adjtimex(2)). In particular, `time_esterror` and `time_maxerror` should get reset
> > > > to `NTP_PHASE_LIMIT` and time_status should get reset to `STA_UNSYNC`.
> > > 
> > > I do not disagree that NTP needs to throw the book out after a live
> > > migration.
> > > 

Right. To recap, this is because where I highlighted 'precisely¹'
above, our accuracy is actually limited to the synchronisation of the
wallclock time between the two hosts. If the guest thinks it has a more
accurate NTP sync than either of the hosts do, we may have introduced
an error which is more than the error bounds the guest thinks it has,
and that may ultimately lead to data corruption in some circumstances.

This part is somewhat orthogonal to the above discussion, isn't it?
Regardless of whether we 'step' the TSC or not, we need to signal the
guest to know that it needs to consider itself unsynchronized (or, if
we want to get really fancy, let it know the upper bounds of the error
we just introduced. But let's not).

> > > But, the issue is how we convey that to the guest. KVM_PVCLOCK_STOPPED
> > > relies on the guest polling a shared structure, and who knows when the
> > > guest is going to check the structure again? If we inject an interrupt
> > > the guest is likely to check this state in a reasonable amount of time.
> > 
> > Ah, but that's the point. A flag in shared memory can be checked
> > whenever the guest needs to know that it's operating on valid state.
> > Linux will check it *every* time from pvclock_clocksource_read().
> > 
> > As opposed to a separate interrupt which eventually gets processed some
> > indefinite amount of time in the future.
> 
> There are a few annoying things with pvclock, though. It is a per-vCPU
> structure, so special care must be taken to act exactly once on a
> migration. Also, since commit 7539b174aef4 ("x86: kvmguest: use TSC
> clocksource if invariant TSC is exposed") the guest kernel could pick
> the TSC over the pvclock by default, so its hard to say when the pvclock
> structure is checked again.

That commit appears to be assuming that the TSC *will* be "stepped", or
as I call it "continue to advance normally at its normal frequency over
elapsed time".

>  This is what I had in mind when suggesting a doorbell is needed, as
> there is no good way to know what clocksource the guest is using.

Yes, perhaps that might be necessary.

The concern is that by the time that doorbell interrupt has finally
been delivered and processed, an inaccurate timestamp could *already*
have been used on the wire in the some application's coherency
protocol, and the guest's database could already be hosed.

But I don't see why we can't have both.  I think it makes sense for the
guest kernel to ditch its NTP sync when it sees PVCLOCK_GUEST_STOPPED,
but I'm not entirely averse to the existence of a doorbell mechanism
such as you describe.

> > I'll give you the assertion that migrations aren't completely
> > invisible, but I still think they should be *equivalent* to the vCPU
> > just not being scheduled for a moment.
> 
> I sure hope that migrations are fast enough that it is indistinguishable
> from scheduler pressure. I think the situations where that is not the
> case are particularly interesting. Defining a limit and having a
> mechanism for remedial action could make things more predictable for
> guest software.
> 
> But agree, and shame on us for the broken virtual hardware when that
> isn't the case :-)

Right. The "how to tell the guest that it needs to ditch its NTP sync
status" question needs some more thought but in the short term I think
it makes sense to add get/set TSC mechanisms which work like
KVM_[SG]ET_CLOCK with the KVM_CLOCK_REALTIME flag.

Instead of realtime though, it should use the KVM clock. So reading a
TSC returns a { kvm_ns, TSC value } tuple, and setting it will advance
the value by the appropriate amount just as we advance the KVM clock
for the KVM_CLOCK_REALTIME delta.

Attachment:
smime.p7s

Description: S/MIME cryptographic signature