Hi, I'm working on OSv (https://github.com/cloudius-systems/osv), a guest operating system. I've been investigating a phenomena of KVM_SYSTEM_TIME being marked as unstable (PVCLOCK_TSC_STABLE_BIT cleared) by KVM on a modern single-socket CPU since the very beginning of guest's life time. According to the ABI, when pvclock_vcpu_time_info.flags does not have the PVCLOCK_TSC_STABLE_BIT set we must not assume monotonicity and need to compensate for this using atomic cmpxchg. However we would like to avoid that cost unless necessary and since modern single-socket CPUs have a stable TSC this cost seems unnecessary. Setups tested: Intel(R) Core(TM) i5-3210M CPU @ 2.50GHz, kernel 3.13.0-29-generic (Ubuntu), Qemu 2.0.0 Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz, kernel 3.14.7-200.fc20.x86_64 (Fedora), Qemu 1.6.2 Both CPUs have constant_tsc and nonstop_tsc flags and the host kernel is using TSC as the master clock source. Here's what tracing KVM during QEMU (v2.0.0) startup shows. I used `trace-cmd record -e kvm` and then `trace-cmd report | egrep '(tsc|clock)'`. === 1 vCPU === qemu-system-x86-1353 [003] 75446.815273: kvm_update_master_clock: masterclock 0 hostclock tsc offsetmatched 0 qemu-system-x86-1355 [002] 75446.849021: kvm_write_tsc_offset: vcpu=0 prev=0 next=18446556065789505232 qemu-system-x86-1355 [002] 75446.849024: kvm_track_tsc: vcpu_id 0 masterclock 0 offsetmatched 0 nr_online 1 hostclock tsc qemu-system-x86-1353 [000] 75446.905313: kvm_write_tsc_offset: vcpu=0 prev=18446556065789505232 next=18446556065789505232 qemu-system-x86-1353 [000] 75446.905315: kvm_track_tsc: vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 1 hostclock tsc qemu-system-x86-1353 [000] 75446.925064: kvm_write_tsc_offset: vcpu=0 prev=18446556065789505232 next=18446556065789505232 qemu-system-x86-1353 [000] 75446.925065: kvm_track_tsc: vcpu_id 0 masterclock 0 offsetmatched 2 nr_online 1 hostclock tsc qemu-system-x86-1353 [000] 75446.925119: kvm_update_master_clock: masterclock 0 hostclock tsc offsetmatched 0 qemu-system-x86-1355 [000] 75446.932931: kvm_update_master_clock: masterclock 0 hostclock tsc offsetmatched 0 === 2 vCPUs === qemu-system-x86-1431 [003] 75539.014452: kvm_update_master_clock: masterclock 0 hostclock tsc offsetmatched 0 qemu-system-x86-1433 [003] 75539.054169: kvm_write_tsc_offset: vcpu=0 prev=0 next=18446555836061830054 qemu-system-x86-1433 [003] 75539.054171: kvm_track_tsc: vcpu_id 0 masterclock 0 offsetmatched 0 nr_online 1 hostclock tsc qemu-system-x86-1434 [003] 75539.078266: kvm_write_tsc_offset: vcpu=1 prev=0 next=18446555836061830054 qemu-system-x86-1434 [003] 75539.078269: kvm_track_tsc: vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc qemu-system-x86-1431 [002] 75539.134695: kvm_write_tsc_offset: vcpu=0 prev=18446555836061830054 next=18446555836061830054 qemu-system-x86-1431 [002] 75539.134698: kvm_track_tsc: vcpu_id 0 masterclock 0 offsetmatched 2 nr_online 2 hostclock tsc qemu-system-x86-1431 [002] 75539.142465: kvm_write_tsc_offset: vcpu=1 prev=18446555836061830054 next=18446555836061830054 qemu-system-x86-1431 [002] 75539.142468: kvm_track_tsc: vcpu_id 1 masterclock 0 offsetmatched 3 nr_online 2 hostclock tsc qemu-system-x86-1431 [002] 75539.182614: kvm_write_tsc_offset: vcpu=0 prev=18446555836061830054 next=18446555836061830054 qemu-system-x86-1431 [002] 75539.182617: kvm_track_tsc: vcpu_id 0 masterclock 0 offsetmatched 4 nr_online 2 hostclock tsc qemu-system-x86-1431 [002] 75539.182758: kvm_write_tsc_offset: vcpu=1 prev=18446555836061830054 next=18446555836061830054 qemu-system-x86-1431 [002] 75539.182758: kvm_track_tsc: vcpu_id 1 masterclock 0 offsetmatched 5 nr_online 2 hostclock tsc qemu-system-x86-1431 [002] 75539.182840: kvm_update_master_clock: masterclock 0 hostclock tsc offsetmatched 0 qemu-system-x86-1433 [003] 75539.194415: kvm_update_master_clock: masterclock 0 hostclock tsc offsetmatched 0 qemu-system-x86-1434 [001] 75539.218634: kvm_update_master_clock: masterclock 0 hostclock tsc offsetmatched 0 When "masterclock" is 0 it means the PVCLOCK_TSC_STABLE_BIT will not be set. It is not set because offsets are supposedly not matched. However according to the kvm_track_tsc tracepoints each TSC write is matched. But there are 3 writes per-CPU which makes ka->nr_vcpus_matched_tsc to become larger than kvm->online_vcpus which makes the following check in pvclock_update_vm_gtod_copy() to fail: vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 == atomic_read(&kvm->online_vcpus)); I turns out that kvm_write_tsc() is called with the target TSC value of 0 from these 3 places per each vCPU: The first one: 0xffffffffa08ff2b4 : vmx_write_tsc_offset+0xa4/0xb0 [kvm_intel] 0xffffffffa04c9c05 : kvm_write_tsc+0x1a5/0x360 [kvm] 0xffffffffa04cfd6b : kvm_arch_vcpu_postcreate+0x4b/0x80 [kvm] 0xffffffffa04b8188 : kvm_vm_ioctl+0x418/0x750 [kvm] 0xffffffff811fd0f0 0xffffffff811fd331 (inexact) 0xffffffff81121eb6 (inexact) 0xffffffff81700869 (inexact) The second one: 0xffffffffa08ff2b4 : vmx_write_tsc_offset+0xa4/0xb0 [kvm_intel] 0xffffffffa04c9c05 : kvm_write_tsc+0x1a5/0x360 [kvm] 0xffffffffa090610d : vmx_set_msr+0x29d/0x350 [kvm_intel] 0xffffffffa04be83b : do_set_msr+0x3b/0x60 [kvm] 0xffffffffa04c10a8 : msr_io+0xc8/0x160 [kvm] 0xffffffffa04caeb6 : kvm_arch_vcpu_ioctl+0xc86/0x1060 [kvm] 0xffffffffa04b6797 : kvm_vcpu_ioctl+0xc7/0x5a0 [kvm] 0xffffffff811fd0f0 0xffffffff811fd331 (inexact) 0xffffffff81121eb6 (inexact) 0xffffffff81700869 (inexact) #0 kvm_vcpu_ioctl (cpu=0x55555637ac40, type=type@entry=0x4008ae89) at /build/buildd/qemu-2.0.0+dfsg/kvm-all.c:1780 #1 0x000055555585ed82 in kvm_put_msrs (cpu=cpu@entry=0x55555637ac40, level=level@entry=0x3) at /build/buildd/qemu-2.0.0+dfsg/target-i386/kvm.c:1270 #2 0x0000555555861edd in kvm_arch_put_registers (cpu=<optimized out>, level=0x3) at /build/buildd/qemu-2.0.0+dfsg/target-i386/kvm.c:1909 #3 0x000055555582b1a2 in kvm_cpu_synchronize_post_init (cpu=cpu@entry=0x55555637ac40) at /build/buildd/qemu-2.0.0+dfsg/kvm-all.c:1641 #4 0x00005555557cef5a in cpu_synchronize_post_init (cpu=0x55555637ac40) at /build/buildd/qemu-2.0.0+dfsg/include/sysemu/kvm.h:330 #5 cpu_synchronize_all_post_init () at /build/buildd/qemu-2.0.0+dfsg/cpus.c:521 #6 0x00005555555f0b1d in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at /build/buildd/qemu-2.0.0+dfsg/vl.c:4390 The third one: 0xffffffffa08ff2b4 : vmx_write_tsc_offset+0xa4/0xb0 [kvm_intel] 0xffffffffa04c9c05 : kvm_write_tsc+0x1a5/0x360 [kvm] 0xffffffffa090610d : vmx_set_msr+0x29d/0x350 [kvm_intel] 0xffffffffa04be83b : do_set_msr+0x3b/0x60 [kvm] 0xffffffffa04c10a8 : msr_io+0xc8/0x160 [kvm] 0xffffffffa04caeb6 : kvm_arch_vcpu_ioctl+0xc86/0x1060 [kvm] 0xffffffffa04b6797 : kvm_vcpu_ioctl+0xc7/0x5a0 [kvm] 0xffffffff811fd0f0 0xffffffff811fd331 (inexact) 0xffffffff81121eb6 (inexact) 0xffffffff81700869 (inexact) #0 kvm_vcpu_ioctl (cpu=0x55555637ac40, type=type@entry=0x4008ae89) at /build/buildd/qemu-2.0.0+dfsg/kvm-all.c:1780 #1 0x000055555585ed82 in kvm_put_msrs (cpu=cpu@entry=0x55555637ac40, level=level@entry=0x2) at /build/buildd/qemu-2.0.0+dfsg/target-i386/kvm.c:1270 #2 0x0000555555861edd in kvm_arch_put_registers (cpu=<optimized out>, level=0x2) at /build/buildd/qemu-2.0.0+dfsg/target-i386/kvm.c:1909 #3 0x000055555582b152 in kvm_cpu_synchronize_post_reset (cpu=cpu@entry=0x55555637ac40) at /build/buildd/qemu-2.0.0+dfsg/kvm-all.c:1635 #4 0x00005555557ceeda in cpu_synchronize_post_reset (cpu=0x55555637ac40) at /build/buildd/qemu-2.0.0+dfsg/include/sysemu/kvm.h:323 #5 cpu_synchronize_all_post_reset () at /build/buildd/qemu-2.0.0+dfsg/cpus.c:512 #6 0x00005555555f0c5d in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at /build/buildd/qemu-2.0.0+dfsg/vl.c:4482 On Qemu 1.6.2 the extra writes are happening only when 1 vCPU is used because of an explicit conditional check in kvm_put_msrs(). With larger number of vCPUs the extra writes do not occur and then TSC is marked as stable. Since Qemu commit f86746c263753cf7a7e4bdb8829c70272dfcf36c the TSC is written unconditionally and hence with Qemu 2.0.0 TSC is marked as unstable for any number of vCPUs. It looks like these extra writes are not anticipated by the TSC synchronization logic and cause off-by-n errors on the nr_vcpus_matched_tsc counter. If there was no overflow of that variable the TSC would be marked as stable for kvmclock. It seems that current logic works in a way that TSC writes which occur on all vCPUs approximately at the same time (within threshold) on a system with a stable TSC will cause a common TSC offset to be used and PVCLOCK_TSC_STABLE_BIT to be set. Perhaps this logic could be adjusted to tolerate these extra TSC writes so that if they match, TSC will be marked as stable for kvmclock. Before going forward with any fixes I would like to know what are your thoughts about this. -- Tomek -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html