2015-09-21 12:52-0300, Marcelo Tosatti: > On Mon, Sep 21, 2015 at 05:12:10PM +0200, Radim Krčmář wrote: >> 2015-09-20 19:57-0300, Marcelo Tosatti: >>> Is it counting from zero that breaks SLES10? >> >> Not by itself, treating MSR_KVM_SYSTEM_TIME as one-shot initializer did. >> The guest wants to write to MSR_KVM_SYSTEM_TIME as much as it likes to, >> while still keeping system time; we used to allow that, which means an >> ABI breakage. (And we can't even say that guest's behaviour is against >> the spec ...) > > Because this behaviour was not defined. It is defined by implementation. > Can't you just condition PVCLOCK_COUNTS_FROM_ZERO behaviour on > boot_vcpu_runs_old_kvmclock == false? > The patch would be much simpler. If you mean the hunk in cover letter, I don't like it because we presume that no other guests were broken. I really don't like it so I thought about other problems with PVCLOCK_COUNTS_FROM_ZERO ... have you tried to hot-replug VCPU 0? It doesn't work well ;) We don't want to guess what the guest wants so I'd go for the opt-in paravirt feature unless counting from zero can be done in guest alone. > The problem is, "selecting one read as the initial point" is inherently > racy: that delta is relative to one moment (kvmclock read) at one vcpu, > but must be applied to all vcpus. I don't think that is a problem. Kvmclock has a notion of a global system_time in nanoseconds (one value that defines the time, assigned with VM ioctl KVM_SET_CLOCK) and tries to propagate system_time into guest VCPUs as precisely as possible with the help of TSC. sched_clock uses kvmclock to get nanoseconds since the system was brought up and [1/2] only works with this abstracted ns count. A poorly synchronized initial read is irrelevant because all VCPUs will be using the same constant offset. (We can never know the precise time anyway.) > Besides: > > 1) Stable sched clock in guest does not depend on > KVM_FEATURE_CLOCKSOURCE_STABLE_BIT. Yes, thanks, I will remove that requirement in v1. (We'd need to improve a loss of PVCLOCK_TSC_STABLE_BIT otherwise anyway.) Because the clutchy dependency on PVCLOCK_TSC_STABLE_BIT is going away, there is now one possible unsigned overflow: in case the clock was very imprecise and VCPU1 managed to get smaller system_time than VCPU0 at the time of initialization. It's very unlikely that we'll ever reach legal overflow so I can add a condition there. > 2) You rely on monotonicity across vcpus to perform > the 'minus delta that was read on vcpu0' calculation, but > monotonicity across vcpus can fail during runtime > (say host clocksource goes tsc->hpet due to tsc instability). That could be a problem, but I'm adding a VCPU independent constant to all reads -- does the new code rely on monoticity in places where the old one didn't? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html