Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm

Thomas Gleixner <tglx@xxxxxxxxxxxxx> · Tue, 29 Mar 2022 16:19:19 +0200

On Tue, Mar 22 2022 at 21:53, Oliver Upton wrote:
> On Tue, Mar 22, 2022 at 07:18:20PM +0000, Franke, Daniel wrote:
>> The KVM_PVCLOCK_STOPPED event should trigger a change in some of the
>> globals kept by kernel/time/ntp.c (which are visible to userspace through
>> adjtimex(2)). In particular, `time_esterror` and `time_maxerror` should get reset
>> to `NTP_PHASE_LIMIT` and time_status should get reset to `STA_UNSYNC`.
>
> I do not disagree that NTP needs to throw the book out after a live
> migration.
>
> But, the issue is how we convey that to the guest. KVM_PVCLOCK_STOPPED
> relies on the guest polling a shared structure, and who knows when the
> guest is going to check the structure again? If we inject an interrupt
> the guest is likely to check this state in a reasonable amount of time.
>
> Thomas, we're talking about how to not wreck time (as bad) under
> virtualization. I know this has been an area of interest to you for a
> while ;-) The idea is that the hypervisor should let the guest know
> about time travel.

Finally. It only took 10+ years of ranting :)

> Let's just assume for now that the hypervisor will *not* quiesce
> the guest into an S2IDLE state before migration. I think quiesced
> migrations are a good thing to have in the toolbelt, but there will
> always be host-side problems that require us to migrate a VM off a host
> immediately with no time to inform the guest.

Quiesced migrations are the right thing on the principle of least
surprise.

> Given that, we're deciding which clock is going to get wrecked during
> a migration and what the guest can do afterwards to clean it up.
> Whichever clock gets wrecked is going to have a window where reads race
> with the eventual fix, and could be completely wrong. My thoughts:
>
> We do not advance the TSC during a migration and notify the
> guest (interrupt, shared structure) about how much it has
> time traveled (delta_REALTIME). REALTIME is wrong until the interrupt
> is handled in the guest, but should fire off all of the existing
> mechanisms for a clock step. Userspace gets notified with
> TFD_TIMER_CANCEL_ON_SET. I believe you have proposed something similar
> as a way to make live migration less sinister from the guest
> perspective.

That still is an information _after_ the fact and not well defined.

> It seems possible to block racing reads of REALTIME if we protect it with
> a migration sequence counter. Host raises the sequence after a migration
> when control is yielded back to the guest. The sequence is odd if an
> update is pending. Guest increments the sequence again after the
> interrupt handler accounts for time travel. That has the side effect of
> blocking all realtime clock reads until the interrupt is handled. But
> what are the value of those reads if we know they're wrong? There is
> also the implication that said shared memory interface gets mapped
> through to userspace for vDSO, haven't thought at all about those
> implications yet.

So you need two sequence counters to be checked similar to the
pv_clock() case, which prevents the usage of TSC without the pv_clock()
overhead.

That also means that CLOCK_REALTIME, CLOCK_TAI and CLOCK_REALTIME_COARSE
will need to become special cased in the VDSO and all realtime based
syscall interfaces.

I'm sure that will be well received from a performance perspective.

Aside of that where do you put the limit? Just user space visible read
outs? What about the kernel? Ignore it in the kernel and hope that
nothing will break? That's going to create really hard to diagnose
issues, which I'm absolutely not interested in.

You can't expand this scheme to the kernel in general because the CPU
which should handle the update might be in an interrupt disabled region
reading clock REALTIME....

Also this would just make the race window smaller. It does not solve the
problem in a consistent way. Don't even think about it. It's just
another layer of duct tape over the existing ones.

As I explained before, this does not solve the following:

   T = now();
   -----------> interruption of some sort (Tout)
   use(T);

Any use case which cares about Tout being less than a certain threshold
has to be carefully designed to handle this. See also below.

> Doing this the other way around (advance the TSC, tell the guest to fix
> MONOTONIC) is fundamentally wrong, as it violates two invariants of the
> monotonic clock. Monotonic counts during a migration, which really is a
> forced suspend. Additionally, you cannot step the monotonic clock.

A migration _should_ have suspend semantics, but the forced suspend
which is done by migration today does not have proper defined semantics
at all.

Also clock monotonic can be stepped forward under certain circumstances
and the kernel is very well able to handle it within well defined
limits. Think about scheduled out vCPUs. From their perspective clock
monotonic is stepping forwards.

The problem starts when the 'force suspended' time becomes excessive, as
that causes the mass expiry of clock monotonic timers with all the nasty
side effects described in a3ed0e4393d6. In the worst case it's going to
exceed the catchup limit of non-suspended timekeeping (~440s for a TSC
@2GHz) which in fact breaks the world and some more.

So let me go back to the use cases:

 1) Regular freeze of a VM to disk and resume at some arbitrary point in
    the future.

 2) Live migration

In both cases the proper solution is to make the guest go into a well
defined state (suspend) and resume it on restore. Everything just works
because it is well defined.

Though you say that there is a special case (not that I believe it):

> but there will always be host-side problems that require us to migrate
> a VM off a host immediately with no time to inform the guest.

which is basically what we do today for the very wrong reasons. There you
have two situations:

  1) Trestart - Tstop < TOLERABLE_THRESHOLD

     That's the easy case as you just can adjust TSC on restore by that
     amount on all vCPUs and be done with it. Just works like scheduling
     out all vCPUs for some time.

  2) Trestart - Tstop >= TOLERABLE_THRESHOLD

     Avoid adjusting TSC for the reasons above. That leaves you with the
     options of

     A) Make NTP unhappy as of today

     B) Provide information about the fact that the vCPUs got scheduled
        out for a long time and teach NTP to be smart about it.

        Right now NTP decides to declare itself unstable when it
        observes the time jump on CLOCK_REALTIME from the NTP server.

        That's exactly the situation described above:

        T = now();
        -----------> interruption of some sort (Tout)
        use(T);

        NTP observes that T is inconsistent because it does not know
        about Tout due to the fact that neither clock MONOTONIC nor
        clock BOOTTIME advanced.

        So here is where you can bring PV information into play by
        providing a migration sequence number in PV shared memory.

        On source host:
           Stop the guest dead in it's tracks.
           Record metadata:
             - migration sequence number
             - clock TAI as observed on the host
           Transfer the image along with metadata

        On destination host:
           Restore memory image
           Expose metadata in PV:
             - migration sequence number + 1
             - Tout (dest/source host delta of clock TAI)
           Run guest

        Guest kernel:

           - Keep track of the PV migration sequence number.

             If it changed act accordingly by injecting the TAI delta,
             which updates NTP state, wakes TFD_TIMER_CANCEL_ON_SET,
             etc...

             We have to do that in the kernel because we cannot rely on
             NTP being running.

             That can be an explicit IPI injected from the host or just
             polling from the tick. It doesn't matter much.

           - Expose information to user space:
               - Migration sequence counter

           - Add some smarts to adjtimex() to prevent user space racing
             against a pending migration update in the kernel.

        NTP:
           - utilize the sequence counter information 

             seq = get_migration_sequence();

             do stuff();

             if (seq != get_migration_sequence())
             	do_something_smart();
             else
                proceed_as_usual();

        That still will leave everything else exposed to
        CLOCK_REALTIME/TAI jumping forward, but there is nothing you can
        do about that and any application which cares about this has to
        be able to deal with it anyway.

Hmm?

Thanks,

        tglx