Re: [QEMU PATCH] kvmclock: advance clock by time window between vm_stop and pre_save

Marcelo Tosatti <mtosatti@xxxxxxxxxx> · Fri, 4 Nov 2016 16:29:48 -0200

On Fri, Nov 04, 2016 at 06:34:20PM +0100, Radim Krčmář wrote:
> 2016-11-04 14:24-0200, Marcelo Tosatti:
> > On Fri, Nov 04, 2016 at 04:48:28PM +0100, Radim Krčmář wrote:
> >> 2016-11-04 16:33+0100, Paolo Bonzini:
> >> > On 04/11/2016 16:25, Radim Krčmář wrote:
> >> >>> >  
> >> >>> > +        if (s->advance_clock && s->clock + s->advance_clock > s->clock) {
> >> >>> > +            s->clock += s->advance_clock;
> >> >>> > +            s->advance_clock = 0;
> >> >>> > +        }
> >> >> Can't the advance_clock added to the migrated KVMClockState instead of
> >> >> passing it as another parameter?
> >> >> 
> >> >> (It is sad that we can't just query KVMClockState in kvmclock_pre_save
> >> >>  because of the Linux bug.)
> >> > 
> >> > What Linux bug?  The one that makes us use kvmclock_current_nsec?
> >> 
> >> No, the one that forced Marcelo to add the 10 minute limit to the
> >> advance_clock. 
> > 
> > Overflow when reading clocksource, in the timer interrupt.
> 
> Is it still there?  I wonder why 10 minutes is the limit. :)

Second question: I don't know, i guess it started as a 
"lets clamp this to values that make sense to catch
any potential offenders" but now i think it makes 
sense to say:

"because it does not make sense to search for the smaller
overflow of Linux guests and use that".

So 10minutes works for me using the sentence:
"migration should not take longer than 10 minutes".

You prefer to drop that 10 minutes check?

First question:

I suppose so: disable CLOCK_MONOTONIC counting on pause, 
pause your VM for two days, and resume.

timekeeping_resume

        cycle_now = tk->tkr_mono.read(clock);
        if ((clock->flags & CLOCK_SOURCE_SUSPEND_NONSTOP) &&
                cycle_now > tk->tkr_mono.cycle_last) {
                u64 num, max = ULLONG_MAX;
                u32 mult = clock->mult;
                u32 shift = clock->shift;
                s64 nsec = 0;

                cycle_delta = clocksource_delta(cycle_now,
tk->tkr_mono.cycle_last,
                                                tk->tkr_mono.mask);

                /*
                 * "cycle_delta * mutl" may cause 64 bits overflow, if
                 * the
                 * suspended time is too long. In that case we need do
                 * the
                 * 64 bits math carefully
                 */
                do_div(max, mult);
                if (cycle_delta > max) {
                        num = div64_u64(cycle_delta, max);
                        nsec = (((u64) max * mult) >> shift) * num;
                        cycle_delta -= num * max;
                }
                nsec += ((u64) cycle_delta * mult) >> shift;

timekeeping_forward_now
/**
 * timekeeping_forward_now - update clock to the current time
 *
 * Forward the current clock to update its state since the last call to
 * update_wall_time(). This is useful before significant clock changes,
 * as it avoids having to deal with this time offset explicitly.
 */
static void timekeeping_forward_now(struct timekeeper *tk)
{
        struct clocksource *clock = tk->tkr_mono.clock;
        cycle_t cycle_now, delta;
        s64 nsec;

        cycle_now = tk->tkr_mono.read(clock);
        delta = clocksource_delta(cycle_now, tk->tkr_mono.cycle_last,
tk->tkr_mono.mask);
        tk->tkr_mono.cycle_last = cycle_now;
        tk->tkr_raw.cycle_last  = cycle_now;

        tk->tkr_mono.xtime_nsec += delta * tk->tkr_mono.mult;

For example.

> >>  We wouldn't need this advance_clock hack if we could
> >> just call KVM_GET_CLOCK like we did before 00f4d64ee76e ("kvmclock:
> >> clock should count only if vm is running").
> > 
> > People have requested that CLOCK_MONOTONIC stops when 
> > vm suspends.
> 
> Ugh, we could have done that on the OS level, 

Done it how at the OS level ?

> but stopping kvmclock
> means that we sabotaged CLOCK_BOOTTIME as it doesn't return
> CLOCK_MONOTONIC + time spent in suspend anymore.

It does:
So this patchset introduces CLOCK_BOOTTIME, which is identical
to CLOCK_MONOTONIC, but includes any time spent in suspend.

"suspend" refers to suspend-to-RAM, not "virtual machine stopped".

CLOCK_BOOTTIME fails to work when you extend "suspend" 
to "suspend AND virtual machine pause or restoration".

(if you want that, can be fixed easily i supposed, but 
nobody ever asked for it).

> And kvmclock is defined to count nanosecond (slightly different in
> practice) since some point in time, so pausing it is not really valid.

Sorry, where is that defined? Nobody makes that assumption
(that kvmclock should be correlated to some physical time
and that it can't stop counting).

> >> > It should work with 4.9-rc (well, once Linus applies my pull request).
> >> > 4.9-rc will not return ktime_get_ns for KVM_GET_CLOCK; it will return
> >> > the raw value from the kernel timekeeper.
> >> > 
> >> > I'm thinking that we should add a KVM capability for this, and skip
> >> > kvmclock_current_nsec if the capability is present.  The first part is
> >> > trivial, so we can do it even during Linux rc period.
> >> 
> >> I agree, that sounds like a nice improvement.
> > 
> > I fail to see how this (KVM_GET_CLOCK using raw value (vs NTP corrected)
> > relates to clocksource overflow and CLOCK_MONOTONIC stop requests.
> 
> If the guest uses master clock, then kvmclock drifts away from BOOTTIME,
> so returning the value of kvmclock using kernel_ns() would introduce a
> drift when setting the clock -- we must read the same way that the guest
> does.

Does not work:

    (periodic incremental reads from TSC, using the offset from
     previous TSC read, storing to scaling to a memory location)
    and reads interpolating with TSC (what kernel does)

        != (different values than)

    (TSC scaled to nanoseconds)

Its not obvious to me this should be the case, perhaps if done 
carefully can work.

(See the thread with Roman, this was determined empirically).

> > -----
> > 
> > Todo list (collective???):
> > 
> > 1) Implement suggestion to Roman: 
> > "Subject: Re: [PATCH] x86/kvm: fix condition to update kvm master
> > clocks" to handle the time backwards when updating masterclock 
> > from kernel clock.
> > 
> > 2) Review TSC scaling support (make sure scaled TSC 
> > is propagated properly everywhere).
> > 
> > 3) Enable migration with invariant TSC.
> > 
> > 4) Fullvirt fix for local APIC deadline timer bug
> 
> The list looks good.
> 
> I'll resume work on (4) now that rework of APIC timers is merged.
> It's going to be ugly on the guest side, because linux could be using it
> even when not using kvmclock for sched clock ...

>From previous discussion on the thread Eduardo started:

>> > Actually, a nicer fix would be to check the different
>> > frequencies and scale the deadline relative to the difference.
>>
>> You cannot know what exactly the guest was thinking when it set the
>> TSC
>> deadline.  Perhaps it wanted an interrupt when the TSC was exactly
>> 9876543210.

Yep, the spec just says that the timer fires when TSC >= deadline.

> You can't expect the correlation between TSC value and timer interrupt
> execution to be precise, because of the delay between HW timer
> expiration and interrupt execution.

Yes, this is valid.

> So you have to live with the fact that the TSC deadline timer can be
> late (which is the same thing as with your paravirt solution, in case
> of migration to host with faster TSC freq) (which to me renders the
> argument above invalid).
>
> Solution for old guests and new guests:
> Just save how far ahead in the future the TSC deadline timer is
> supposed
> to expire on the source (in ns), in the destination save all registers
> (but disable TSC deadline timer injection), arm a timer in QEMU
> for expiration time, reenable TSC deadline timer reinjection.

The interrupt can also fire early after migrating to a TSC with lower
frequency, which violates the definition of TSC deadline timer when an
OS could even RDTSC a value lower than the deadline in the interrupt
handler.  (An OS doesn't need to expect that.)

We already have vcpu->arch.virtual_tsc_khz that ought to remain constant
for a lifetime of the guest and KVM uses it to convert TSC delta into
correct nanoseconds.

The main problem is that QEMU changes virtual_tsc_khz when migrating
without hardware scaling, so KVM is forced to get nanoseconds wrong ...

If QEMU doesn't want to keep the TSC frequency constant, then it would
be better if it didn't expose TSC in CPUID -- guest would just use
kvmclock without being tempted by direct TSC accesses.

> > (BTW Paolo Windows is also vulnerable right).
> 
> Ugh, we can't do much more than disabling TSC deadline there until QEMU
> can migrate it.

So the idea was to convert to nanoseconds, count a timer
in nanoseconds, and when it expires inject immediatelly.
(Yes its ugly).

(btw i suppose you can talk to destination via the command interface 
on the postcopy patches now), say to get the destination tsc frequency.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html