Re: [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM_ZERO.

Radim Krčmář <rkrcmar@xxxxxxxxxx> · Mon, 21 Sep 2015 22:00:27 +0200

2015-09-21 12:52-0300, Marcelo Tosatti:
> On Mon, Sep 21, 2015 at 05:12:10PM +0200, Radim Krčmář wrote:
>> 2015-09-20 19:57-0300, Marcelo Tosatti:
>>> Is it counting from zero that breaks SLES10?
>> 
>> Not by itself, treating MSR_KVM_SYSTEM_TIME as one-shot initializer did.
>> The guest wants to write to MSR_KVM_SYSTEM_TIME as much as it likes to,
>> while still keeping system time;  we used to allow that, which means an
>> ABI breakage.  (And we can't even say that guest's behaviour is against
>> the spec ...)
> 
> Because this behaviour was not defined.

It is defined by implementation.

> Can't you just condition PVCLOCK_COUNTS_FROM_ZERO behaviour on
> boot_vcpu_runs_old_kvmclock == false? 
> The patch would be much simpler.

If you mean the hunk in cover letter, I don't like it because we presume
that no other guests were broken.

I really don't like it so I thought about other problems with
PVCLOCK_COUNTS_FROM_ZERO ... have you tried to hot-replug VCPU 0?
It doesn't work well ;)

We don't want to guess what the guest wants so I'd go for the opt-in
paravirt feature unless counting from zero can be done in guest alone.

> The problem is, "selecting one read as the initial point" is inherently
> racy: that delta is relative to one moment (kvmclock read) at one vcpu,
> but must be applied to all vcpus.

I don't think that is a problem.

Kvmclock has a notion of a global system_time in nanoseconds (one value
that defines the time, assigned with VM ioctl KVM_SET_CLOCK) and tries
to propagate system_time into guest VCPUs as precisely as possible with
the help of TSC.

sched_clock uses kvmclock to get nanoseconds since the system was
brought up and [1/2] only works with this abstracted ns count.
A poorly synchronized initial read is irrelevant because all VCPUs will
be using the same constant offset.
(We can never know the precise time anyway.)

> Besides:
> 
> 	1) Stable sched clock in guest does not depend on
> 	   KVM_FEATURE_CLOCKSOURCE_STABLE_BIT.

Yes, thanks, I will remove that requirement in v1.  (We'd need to
improve a loss of PVCLOCK_TSC_STABLE_BIT otherwise anyway.)

Because the clutchy dependency on PVCLOCK_TSC_STABLE_BIT is going away,
there is now one possible unsigned overflow: in case the clock was very
imprecise and VCPU1 managed to get smaller system_time than VCPU0 at the
time of initialization.  It's very unlikely that we'll ever reach legal
overflow so I can add a condition there.

> 	2) You rely on monotonicity across vcpus to perform 
> 	   the 'minus delta that was read on vcpu0' calculation, but 
> 	   monotonicity across vcpus can fail during runtime
>            (say host clocksource goes tsc->hpet due to tsc instability).

That could be a problem, but I'm adding a VCPU independent constant to
all reads -- does the new code rely on monoticity in places where the
old one didn't?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html