On Tue, Sep 22, 2015 at 12:00:39AM +0200, Radim Krčmář wrote: > 2015-09-21 17:53-0300, Marcelo Tosatti: > > On Mon, Sep 21, 2015 at 10:00:27PM +0200, Radim Krčmář wrote: > >> 2015-09-21 12:52-0300, Marcelo Tosatti: > >> > On Mon, Sep 21, 2015 at 05:12:10PM +0200, Radim Krčmář wrote: > >> >> 2015-09-20 19:57-0300, Marcelo Tosatti: > >> >>> Is it counting from zero that breaks SLES10? > >> >> > >> >> Not by itself, treating MSR_KVM_SYSTEM_TIME as one-shot initializer did. > >> >> The guest wants to write to MSR_KVM_SYSTEM_TIME as much as it likes to, > >> >> while still keeping system time; we used to allow that, which means an > >> >> ABI breakage. (And we can't even say that guest's behaviour is against > >> >> the spec ...) > >> > > >> > Because this behaviour was not defined. > >> > >> It is defined by implementation. > >> > >> > Can't you just condition PVCLOCK_COUNTS_FROM_ZERO behaviour on > >> > boot_vcpu_runs_old_kvmclock == false? > >> > The patch would be much simpler. > >> > >> If you mean the hunk in cover letter, I don't like it because we presume > >> that no other guests were broken. > > > > Yes patch 1. > > (I'd call them: a hunk in cover letter [0/2], patch 1 = [1/2], and > patch 2 = [2/2].) > > > Do you have an example of another guest that is broken? > > Yes, the hot-plug in any relatively recent Linux guest. > > > Really, assuming things about the value of the msr read when you > > write to the MSR is very awkward. That SuSE code is incorrect, it fails > > on other occasions as well (see > > 54750f2cf042c42b4223d67b1bd20138464bde0e). > > Yeah, I even read the SUSE implementation, sad times. > > >> I really don't like it > > > > Because it changes behaviour that was previously unspecified? > > No, because it adds another exemption to already ugly code. > (Talking about the hunk in [0/2].) > > >> so I thought about other problems with > >> PVCLOCK_COUNTS_FROM_ZERO ... have you tried to hot-replug VCPU 0? > > > > Can't unplug VCPU 0 i think. > > You can. > > >> It doesn't work well ;) > > > > Can you hot-unplug vcpu 0? > > I can, I did, hot-plug bugged. Due to PVCLOCK_COUNTS_FROM_ZERO patch? > >> > The problem is, "selecting one read as the initial point" is inherently > >> > racy: that delta is relative to one moment (kvmclock read) at one vcpu, > >> > but must be applied to all vcpus. > >> > >> I don't think that is a problem. > >> > >> Kvmclock has a notion of a global system_time in nanoseconds (one value > >> that defines the time, assigned with VM ioctl KVM_SET_CLOCK) and tries > >> to propagate system_time into guest VCPUs as precisely as possible with > >> the help of TSC. > > > >Different pairs of values (system_time, tsc reads) are fundamentally > >broken. This is why > > > >commit 489fb490dbf8dab0249ad82b56688ae3842a79e8 > > x86, paravirt: Add a global synchronization point for pvclock > > > >Exists. > > > >Then to guarantee monotonicity you need to stop those updates > >(or perform the pair update in lock step): > > > >commit d828199e84447795c6669ff0e6c6d55eb9beeff6 > > KVM: x86: implement PVCLOCK_TSC_STABLE_BIT pvclock flag > > (Thanks for the commits, split values are the core design that avoids > modifying rdtsc, Which is broken (thats the point). > so I'll be grad to read its details.) > > >> > 2) You rely on monotonicity across vcpus to perform > >> > the 'minus delta that was read on vcpu0' calculation, but > >> > monotonicity across vcpus can fail during runtime > >> > (say host clocksource goes tsc->hpet due to tsc instability). > >> > >> That could be a problem, but I'm adding a VCPU independent constant to > >> all reads -- does the new code rely on monoticity in places where the > >> old one didn't? > > > > The problem is overflow: > > kvmclock non-monotonic across vcpus AND delta subtraction: > > > > ptime | vcpu0kvmclock time | vcpu0sched_clock | vcpu1kvmclock time | vcpu1sched_clock > > 1 1 1 > > 2 2 2 1 > > KVM sched clock not used before this point (I even moved the code to > make sure), so there is no problem with monotonicity. > > > 3 3 0 2 -1 > ^^^ > -1 is the overflow. Very unlikely to ever happen in Linux, as we now > boot other VCPUs much later than the KVM clock configuration and it can > only happen if VCPU1 is running as the reconfiguration takes place, but > a simple > > if (vcpu[x].sched_clock <= global_sched_clock_offset) > return 0; > > would take care of it. The time would stand still for a while, which is > not a huge problem for boot-only scenario. Look at commit 489fb490dbf8dab0249ad82b56688ae3842a79e8 x86, paravirt: Add a global synchronization point for pvclock And think of what an overflow does. Note: i tried your solution before (to add offset) and saw this issue in practice. > Other VCPUs couldn't be > reading KVM sched before it was configured, so only operations that > happen before vcpu1sched_clock goes positive are affected. > (We have other problems if the window can be long.) The point where vcpu1sched_clock is negative is after kvmclock is initialized. > > > 4 4 1 3 0 > > 5 5 2 4 1 > > 6 6 3 5 2 > > 7 7 4 6 3 > > > > ptime is "physical time". > > > > I prefer that the host counts from zero (there are less problems to > > handle). > > The main advantage of a hypervisor solution for me is one saved > subtraction and comparison on every sched_clock(). To me are such complications as handling overflows. > > Again, that SuSE patch is incorrect and a huge hack. > > I'm not disputing that. (Which doesn't justify the breakage.) > > > An alternative is to enable PVCLOCK_COUNTS_FROM_ZERO _if_ the guest > > requests the feature. > > Yes, and the alternative needs new paravirt interface. So either: Proceed with guest solution: -> Make sure the overflow can't happen (and write down why not in the code). Don't assume a small delta between kvmclock values of vcpus. -> Handle stable -> non-stable kvmclock transition. -> kvmclock counts from zero should not depend on stable kvmclock (because nohz_full should work on hpet host systems). Enable counts-from-zero on MSR_KVM_SYSTEM_TIME_NEW: -> Figure out whats wrong with different kvmclock values on hotplug, and fix it. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html