Re: [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM_ZERO.

Radim Krčmář <rkrcmar@xxxxxxxxxx> · Mon, 21 Sep 2015 17:12:10 +0200

2015-09-20 19:57-0300, Marcelo Tosatti:
> On Fri, Sep 18, 2015 at 05:54:28PM +0200, Radim Krčmář wrote:
>> This patch series will be disabling PVCLOCK_COUNTS_FROM_ZERO flag and is
>> RFC because I haven't explored many potential problems or tested it.
> 
> The justification to disable PVCLOCK_COUNTS_FROM_ZERO is because you
> haven't explored potential problems or tested it? Sorry can't parse it.
> 
>> 
>> [1/2] uses a different algorithm in the guest to start counting from 0.
>> [2/2] stops exposing PVCLOCK_COUNTS_FROM_ZERO in the hypervisor.
>> 
>> A viable alternative would be to implement opt-in features in kvm clock.
>> 
>> And because we probably only broke one old user (the infamous SLES 10), a
>> workaround like this is also possible: (but I'd rather not do that)
> 
> Please describe why SLES 10 breaks in detail: the state of the guest and
> the host before the patch, the state of the guest and host after the
> patch.

1) The guest periodically receives an interrupt that is handled by
   main_timer_handler():
   a) get time using the kvm clock:
      1) write the address to MSR_KVM_SYSTEM_TIME
      2) read tsc and pvclock (tsc_offset, system_time)
      3) time = tsc - tsc_offset + system_time
   b) compute time since the last main_timer_handler()
   c) bump jiffies if enough time has elapsed
2) the guest wants to calibrate loops per jiffy [1]:
   a) read tsc
   b) loop till jiffies increase
   c) compute lpj

Because (1a1) always resets the system_time to 0, we read the same value
over and over so the condition for (1c) is never true and jiffies remain
constant.  This is the problem.  A hang happens in (2b) as it is the
first place that depends on jiffies.

> What does SLES10 expect?

That a write to MSR_KVM_SYSTEM_TIME does not reset the system time.

> Is it counting from zero that breaks SLES10?

Not by itself, treating MSR_KVM_SYSTEM_TIME as one-shot initializer did.
The guest wants to write to MSR_KVM_SYSTEM_TIME as much as it likes to,
while still keeping system time;  we used to allow that, which means an
ABI breakage.  (And we can't even say that guest's behaviour is against
the spec ...)

---
1: I also did diassembly, but the reproducer is easier to paste
   (couldn't find debuginfo)
   # qemu-kvm -nographic -kernel vmlinuz-2.6.16.60-0.85.1-default \
    -serial stdio -monitor /dev/null -append 'console=ttyS0'

  and you can get a bit further when setting loops per jiffy manually,
    -serial stdio -monitor /dev/null -append 'console=ttyS0 lpj=12345678'

  The dmesg for failing run is
    Initializing CPU#0
    PID hash table entries: 512 (order: 9, 16384 bytes)
    kvm-clock: cpu 0, msr 0:3f6041, boot clock
    kvm_get_tsc_khz: cpu 0, msr 0:e001
    time.c: Using tsc for timekeeping HZ 250
    time.c: Using 100.000000 MHz WALL KVM GTOD KVM timer.
    time.c: Detected 2591.580 MHz processor.
    Console: colour VGA+ 80x25
    Dentry cache hash table entries: 16384 (order: 5, 131072 bytes)
    Inode-cache hash table entries: 8192 (order: 4, 65536 bytes)
    Checking aperture...
    Nosave address range: 000000000009f000 - 00000000000a0000
    Nosave address range: 00000000000a0000 - 00000000000f0000
    Nosave address range: 00000000000f0000 - 0000000000100000
    Memory: 124884k/130944k available (1856k kernel code, 5544k reserved, 812k data, 188k init)
    [Infinitely querying kvm clock here ...]

  With '-cpu kvm64,-kvmclock', the next line is
    Calibrating delay using timer specific routine.. 5199.75 BogoMIPS (lpj=10399519)

  With 'lpj=10399519',
    Calibrating delay loop (skipped)... 5199.75 BogoMIPS preset
    [Manages to get stuck later, in default_idle.]
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html