Re: [PATCH] x86/kvm: fix condition to update kvm master clocks

Marcelo Tosatti <mtosatti@xxxxxxxxxx> · Thu, 9 Jun 2016 15:25:02 -0300

On Thu, Jun 09, 2016 at 03:09:46PM +0300, Roman Kagan wrote:
> On Thu, Jun 09, 2016 at 12:27:10AM -0300, Marcelo Tosatti wrote:
> > Ok, its not feasible to keep both REF_CLOCK (MSR) and shared memory
> > (REF_PAGE) clocks in sync. Even if the frequency correction is the same
> > for both, the kernel updates monotonic clock differently than the
> > stated frequency that is:
> > 
> >     monotonic clock (advertised via vsyscall notifier to use mult/freq pair) != tsc*freq 
> > 
> > So the best solution IMO is to: 
> > 
> >     reads of guest clock = max(shared memory clock, get_kernel_ns() +
> >                                kvmclock_offset)
> > 
> > Where reads of guest clock include: 1) kvm_get_time_and_clockread
> > (updates to kvmclock areas), 2) rdmsr(REF_CLOCK).
> > 
> > Unless someone has a better idea, Roman, can you update your patch to
> > include such solution? for kvm_get_time_and_clockread, can fast forward
> > kvmclock_offset so that 
> > 
> > kvmclock_offset + get_kernel_ns() = shared memory clock
> 
> I'm not sure I understand what you mean.
> 
> ->system_time in pvclock *is* assigned kernel_ns + kvmclock_offset, i.e.
> at the time kvm_get_time_and_clockread() runs they are in sync by
> definition.  They can diverge later due to different whole number math
> applied.

Sync kvmclock_offset + get_kernel_ns() = shared memory clock (what you
read from the guest). 

Add wrapper around get_kernel_ns + kvmclock_offset reads:

Record somewhere the last returned (last_returned_guestclock).

u64 read_guest_clock(struct kvm *kvm)
{
      mutex_lock(&guest_clock_mutex);
      ret = get_kernel_ns() + kvmclock_offset;
      kvm->arch.last_returned_guestclock = ret;
      mutex_unlock(&guest_clock_mutex);
}

Sync code (to be executed at masterclock updates and rdmsr(REF_CLOCK)):

1. read guest shared memory = smval.
2. read guest clock = get_kernel_ns() + kvmclock_offset = kclock
3. if (kclock < smval) 
        kvmclock_offset += smval - kclock

Two possibilites for clocks state:

P1. shared memory clock > get_kernel_ns() + kvmclock_offset
P2. shared memory clock < get_kernel_ns() + kvmclock_offset

Two possibilites for guest behaviour:
G1. a = shared memory clock read;
    b = get_kernel_ns() + kvmclock_offset

G1/P1:

    a = shared memory clock read (previous read, stored in memory)
    b = get_kernel_ns() + kvmclock_offset

After sync code above: Note smval > a, so b = smval > a

G1/P2:

    a = shared memory clock read (previous read, stored in memory)
    b = get_kernel_ns() + kvmclock_offset

a < b, fine.

G2 (second possibility for guest behaviour)
    a = get_kernel_ns() + kvmclock_offset
    b = shared memory clock read

G2/P1: fine, b > a.

G2/P2: 
    a = get_kernel_ns() + kvmclock_offset >
    b = shared memory clock read

So we have to either reduce get_kernel_ns() + kvmclock_offset so that 
b is larger or 'stop get_kernel_ns() + kvmclock_offset'.

Can make get_kernel_ns() + kvmclock_offset be as small as
last_returned_guestclock (otherwise users of get_kernel_ns() +
kvmclock_offset can see time backwards).

    0. mutex_lock(&guest_clock_mutex);
    01. getkernelns = get_kernel_ns();
    1. read guest shared memory = smval.
    2. kclock = getkernelns + kvmclock_offset
    3. if (kclock > smval) 
        kvmclock_offset -= min(kvmclock_offset - last_returned_guestclock,
                              kclock - smval)
    4. kclock = getkernelns + kvmclock_offset
    5. if (kclock > smval) {
        schedule_timeout(kclock - smval);
        kvmclock_offset -= min(kvmclock_offset - last_returned_guestclock,
                              kclock - smval)
       }
    6. mutex_unlock(&guest_clock_mutex);

That works, right?

I'll code that patch next week if you don't beat me to it.

> There's also a problem that there can be arbitrary amount of time
> between assigning the return value for guest's rdmsr and actually
> entering the guest to deliver that value.

Don't think that matters, see the 4 cases above.

> In general I'm starting to feel the shared memory clock is trying to
> provide stronger guarantees than really useful.  E.g. there's no such
> thing as synchronous TSC between vCPUs in a virtual machine, so every
> guest assuming it is broken; 

There is, the TSC is monotonic between pCPUs:

    pCPU1          |       pCPU2

1.  a = read tsc
2.                      b = read tsc.

> in reality that means that every sane guest
> must tolerate certain violations of monotonicity when multiple CPUs are
> used for timekeeping.  I wonder if this consideration can allow for some
> simplification of the paravirtual clock code...

I think applications can fail.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html