2015-09-20 19:57-0300, Marcelo Tosatti: > On Fri, Sep 18, 2015 at 05:54:28PM +0200, Radim Krčmář wrote: >> This patch series will be disabling PVCLOCK_COUNTS_FROM_ZERO flag and is >> RFC because I haven't explored many potential problems or tested it. > > The justification to disable PVCLOCK_COUNTS_FROM_ZERO is because you > haven't explored potential problems or tested it? Sorry can't parse it. > >> >> [1/2] uses a different algorithm in the guest to start counting from 0. >> [2/2] stops exposing PVCLOCK_COUNTS_FROM_ZERO in the hypervisor. >> >> A viable alternative would be to implement opt-in features in kvm clock. >> >> And because we probably only broke one old user (the infamous SLES 10), a >> workaround like this is also possible: (but I'd rather not do that) > > Please describe why SLES 10 breaks in detail: the state of the guest and > the host before the patch, the state of the guest and host after the > patch. 1) The guest periodically receives an interrupt that is handled by main_timer_handler(): a) get time using the kvm clock: 1) write the address to MSR_KVM_SYSTEM_TIME 2) read tsc and pvclock (tsc_offset, system_time) 3) time = tsc - tsc_offset + system_time b) compute time since the last main_timer_handler() c) bump jiffies if enough time has elapsed 2) the guest wants to calibrate loops per jiffy [1]: a) read tsc b) loop till jiffies increase c) compute lpj Because (1a1) always resets the system_time to 0, we read the same value over and over so the condition for (1c) is never true and jiffies remain constant. This is the problem. A hang happens in (2b) as it is the first place that depends on jiffies. > What does SLES10 expect? That a write to MSR_KVM_SYSTEM_TIME does not reset the system time. > Is it counting from zero that breaks SLES10? Not by itself, treating MSR_KVM_SYSTEM_TIME as one-shot initializer did. The guest wants to write to MSR_KVM_SYSTEM_TIME as much as it likes to, while still keeping system time; we used to allow that, which means an ABI breakage. (And we can't even say that guest's behaviour is against the spec ...) --- 1: I also did diassembly, but the reproducer is easier to paste (couldn't find debuginfo) # qemu-kvm -nographic -kernel vmlinuz-2.6.16.60-0.85.1-default \ -serial stdio -monitor /dev/null -append 'console=ttyS0' and you can get a bit further when setting loops per jiffy manually, -serial stdio -monitor /dev/null -append 'console=ttyS0 lpj=12345678' The dmesg for failing run is Initializing CPU#0 PID hash table entries: 512 (order: 9, 16384 bytes) kvm-clock: cpu 0, msr 0:3f6041, boot clock kvm_get_tsc_khz: cpu 0, msr 0:e001 time.c: Using tsc for timekeeping HZ 250 time.c: Using 100.000000 MHz WALL KVM GTOD KVM timer. time.c: Detected 2591.580 MHz processor. Console: colour VGA+ 80x25 Dentry cache hash table entries: 16384 (order: 5, 131072 bytes) Inode-cache hash table entries: 8192 (order: 4, 65536 bytes) Checking aperture... Nosave address range: 000000000009f000 - 00000000000a0000 Nosave address range: 00000000000a0000 - 00000000000f0000 Nosave address range: 00000000000f0000 - 0000000000100000 Memory: 124884k/130944k available (1856k kernel code, 5544k reserved, 812k data, 188k init) [Infinitely querying kvm clock here ...] With '-cpu kvm64,-kvmclock', the next line is Calibrating delay using timer specific routine.. 5199.75 BogoMIPS (lpj=10399519) With 'lpj=10399519', Calibrating delay loop (skipped)... 5199.75 BogoMIPS preset [Manages to get stuck later, in default_idle.] -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html