Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm

Oliver Upton <oupton@xxxxxxxxxx> · Tue, 22 Mar 2022 21:53:03 +0000

Hi Daniel,

As we're venturing into the realm of timekeeping and he has noted
the evils of virtualization several times, I've CC'ed Thomas on this
mail. Some comments inline on the ongoing discussion.

On Tue, Mar 22, 2022 at 07:18:20PM +0000, Franke, Daniel wrote:
> On 3/21/22, 5:24 PM, "Oliver Upton" <oupton@xxxxxxxxxx> wrote:
>  >  Right, but I'd argue that interface has some problems too. It
>  >   depends on the guest polling instead of an interrupt from the
>  >   hypervisor. It also has no way of informing the kernel exactly how much
>  >   time has elapsed.
> 
>  >   The whole point of all these hacks that we've done internally is that we,
>  >   the hypervisor, know full well how much real time hasv advanced during the
>  >   VM blackout. If we can at least let the guest know how much to fudge real
>  >   time, it can then poke NTP for better refinement. I worry about using NTP
>  >   as the sole source of truth for such a mechanism, since you'll need to go
>  >   out to the network and any reads until the response comes back are hosed.
> 
> (I'm a kernel newbie, so please excuse any ignorance with respect to kernel
> Internals or kernel/hypervisor interfaces.)

Welcome :-)

> We can have it both ways, I think. Let the hypervisor manipulate the guest TSC
> so as to keep the guest kernel's idea of real time as accurate as possible 
> without any awareness required on the guest's side. *Also* give the guest kernel
> a notification in the form of a KVM_PVCLOCK_STOPPED event or whatever else,
> and let the kernel propagate this notification to userspace so that the NTP
> daemon can recombobulate itself as quickly as possible, treating whatever TSC
> adjustment was received as best-effort only.

But what happens to CLOCK_MONOTONIC in this case? We are still accepting
the fact that live migrations destroy CLOCK_MONOTONIC if we directly
advance the guest TSCs to account for elapsed time. The definition of
CLOCK_MONOTONIC is that the clock does not count while the system is
suspended. From the viewpoint of the guest, a live migration appears to
be a forced suspend operation at an arbitrary instruction boundary.
There is no realistic way for the guest to give the illusion that
MONOTONIC has stopped without help from the hypervisor.

> The KVM_PVCLOCK_STOPPED event should trigger a change in some of the
> globals kept by kernel/time/ntp.c (which are visible to userspace through
> adjtimex(2)). In particular, `time_esterror` and `time_maxerror` should get reset
> to `NTP_PHASE_LIMIT` and time_status should get reset to `STA_UNSYNC`.

I do not disagree that NTP needs to throw the book out after a live
migration.

But, the issue is how we convey that to the guest. KVM_PVCLOCK_STOPPED
relies on the guest polling a shared structure, and who knows when the
guest is going to check the structure again? If we inject an interrupt
the guest is likely to check this state in a reasonable amount of time.

Thomas, we're talking about how to not wreck time (as bad) under
virtualization. I know this has been an area of interest to you for a
while ;-) The idea is that the hypervisor should let the guest know
about time travel.

Let's just assume for now that the hypervisor will *not* quiesce
the guest into an S2IDLE state before migration. I think quiesced
migrations are a good thing to have in the toolbelt, but there will
always be host-side problems that require us to migrate a VM off a host
immediately with no time to inform the guest.

Given that, we're deciding which clock is going to get wrecked during
a migration and what the guest can do afterwards to clean it up.
Whichever clock gets wrecked is going to have a window where reads race
with the eventual fix, and could be completely wrong. My thoughts:

We do not advance the TSC during a migration and notify the
guest (interrupt, shared structure) about how much it has
time traveled (delta_REALTIME). REALTIME is wrong until the interrupt
is handled in the guest, but should fire off all of the existing
mechanisms for a clock step. Userspace gets notified with
TFD_TIMER_CANCEL_ON_SET. I believe you have proposed something similar
as a way to make live migration less sinister from the guest
perspective.

It seems possible to block racing reads of REALTIME if we protect it with
a migration sequence counter. Host raises the sequence after a migration
when control is yielded back to the guest. The sequence is odd if an
update is pending. Guest increments the sequence again after the
interrupt handler accounts for time travel. That has the side effect of
blocking all realtime clock reads until the interrupt is handled. But
what are the value of those reads if we know they're wrong? There is
also the implication that said shared memory interface gets mapped
through to userspace for vDSO, haven't thought at all about those
implications yet.

Doing this the other way around (advance the TSC, tell the guest to fix
MONOTONIC) is fundamentally wrong, as it violates two invariants of the
monotonic clock. Monotonic counts during a migration, which really is a
forced suspend. Additionally, you cannot step the monotonic clock.

Thoughts?

Sorry to revisit this conversation yet again. Virtualization isn't going
away any time soon and the illusion that migrations are invisible to the
guest is simply not true.

--
Thanks,
Oliver