On Mon, Sep 21, 2015 at 05:12:10PM +0200, Radim Krčmář wrote: > 2015-09-20 19:57-0300, Marcelo Tosatti: > > On Fri, Sep 18, 2015 at 05:54:28PM +0200, Radim Krčmář wrote: > >> This patch series will be disabling PVCLOCK_COUNTS_FROM_ZERO flag and is > >> RFC because I haven't explored many potential problems or tested it. > > > > The justification to disable PVCLOCK_COUNTS_FROM_ZERO is because you > > haven't explored potential problems or tested it? Sorry can't parse it. > > > >> > >> [1/2] uses a different algorithm in the guest to start counting from 0. > >> [2/2] stops exposing PVCLOCK_COUNTS_FROM_ZERO in the hypervisor. > >> > >> A viable alternative would be to implement opt-in features in kvm clock. > >> > >> And because we probably only broke one old user (the infamous SLES 10), a > >> workaround like this is also possible: (but I'd rather not do that) > > > > Please describe why SLES 10 breaks in detail: the state of the guest and > > the host before the patch, the state of the guest and host after the > > patch. > > 1) The guest periodically receives an interrupt that is handled by > main_timer_handler(): > a) get time using the kvm clock: > 1) write the address to MSR_KVM_SYSTEM_TIME > 2) read tsc and pvclock (tsc_offset, system_time) > 3) time = tsc - tsc_offset + system_time > b) compute time since the last main_timer_handler() > c) bump jiffies if enough time has elapsed > 2) the guest wants to calibrate loops per jiffy [1]: > a) read tsc > b) loop till jiffies increase > c) compute lpj > > Because (1a1) always resets the system_time to 0, we read the same value > over and over so the condition for (1c) is never true and jiffies remain > constant. This is the problem. A hang happens in (2b) as it is the > first place that depends on jiffies. > > > What does SLES10 expect? > > That a write to MSR_KVM_SYSTEM_TIME does not reset the system time. > > > Is it counting from zero that breaks SLES10? > > Not by itself, treating MSR_KVM_SYSTEM_TIME as one-shot initializer did. > The guest wants to write to MSR_KVM_SYSTEM_TIME as much as it likes to, > while still keeping system time; we used to allow that, which means an > ABI breakage. (And we can't even say that guest's behaviour is against > the spec ...) Because this behaviour was not defined. Can't you just condition PVCLOCK_COUNTS_FROM_ZERO behaviour on boot_vcpu_runs_old_kvmclock == false? The patch would be much simpler. The problem is, "selecting one read as the initial point" is inherently racy: that delta is relative to one moment (kvmclock read) at one vcpu, but must be applied to all vcpus. Besides: 1) Stable sched clock in guest does not depend on KVM_FEATURE_CLOCKSOURCE_STABLE_BIT. 2) You rely on monotonicity across vcpus to perform the 'minus delta that was read on vcpu0' calculation, but monotonicity across vcpus can fail during runtime (say host clocksource goes tsc->hpet due to tsc instability). > > > --- > 1: I also did diassembly, but the reproducer is easier to paste > (couldn't find debuginfo) > # qemu-kvm -nographic -kernel vmlinuz-2.6.16.60-0.85.1-default \ > -serial stdio -monitor /dev/null -append 'console=ttyS0' > > and you can get a bit further when setting loops per jiffy manually, > -serial stdio -monitor /dev/null -append 'console=ttyS0 lpj=12345678' > > The dmesg for failing run is > Initializing CPU#0 > PID hash table entries: 512 (order: 9, 16384 bytes) > kvm-clock: cpu 0, msr 0:3f6041, boot clock > kvm_get_tsc_khz: cpu 0, msr 0:e001 > time.c: Using tsc for timekeeping HZ 250 > time.c: Using 100.000000 MHz WALL KVM GTOD KVM timer. > time.c: Detected 2591.580 MHz processor. > Console: colour VGA+ 80x25 > Dentry cache hash table entries: 16384 (order: 5, 131072 bytes) > Inode-cache hash table entries: 8192 (order: 4, 65536 bytes) > Checking aperture... > Nosave address range: 000000000009f000 - 00000000000a0000 > Nosave address range: 00000000000a0000 - 00000000000f0000 > Nosave address range: 00000000000f0000 - 0000000000100000 > Memory: 124884k/130944k available (1856k kernel code, 5544k reserved, 812k data, 188k init) > [Infinitely querying kvm clock here ...] > > With '-cpu kvm64,-kvmclock', the next line is > Calibrating delay using timer specific routine.. 5199.75 BogoMIPS (lpj=10399519) > > With 'lpj=10399519', > Calibrating delay loop (skipped)... 5199.75 BogoMIPS preset > [Manages to get stuck later, in default_idle.] -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html