This is the summary of few things that I think are relevant. Best regards, Maxim Levitsky
# Random unsynchronized ramblings about the TSC in KVM/Linux ## The KVM's master clock Under assumption that a. Host TSC is synchronized and stable (wasn't marked as unstable). b. Guest TSC is synchronized: - When guest starts running, all its vCPUs start from TSC = 0. - If we hotplug a vCPU, its TSC is set to 0, but the 'kvm_synchronize_tsc' (it was used to be called kvm_write_tsc), practically speaking, just sets the TSC to the same value as other vCPUs are having right now. Later Linux will try to sync it again, but since it is already synchronized it won't do anything. Otherwise it uses IA_32_MSR_TSC_ADJUST to adjust it. - If the guest writes the TSC we try to detect if TSC is still synced. (We don't handle TSC adjustments done via IA_32_MSR_TSC_ADJUST). Then the kvmclock is driven by a single pair of (nsecs, tsc) and that is used to update the kvmclock on all vCPUs. The advantage of this is that no random error can be introduced by calculating this pair on each CPU. Plus another advantage is that being vCPU invariant, guest's kvmclock implementation can read it in userspace. This is signaled by setting KVM_CLOCK_TSC_STABLE flag via kvmclock interface. ## KVM behavior when the host tsc is detected as unstable * On each 'userspace' VM entry (aka 'vcpu_load'), we set guest tsc to its last value captured on last userspace VM exit, and we schedule a KVM clock update (KVM_REQ_GLOBAL_CLOCK_UPDATE) * On each KVM clock update, we 'catchup' the tsc to the kernel clock. * We don't use masterclock ## The TSC 'features' in the Linux kernel Linux kernel has roughly speaking 4 cpu 'features' that define its treatment of TSC. Some of these features are set from CPUID, some are set when certain CPU models are detected, and some are set when a specific hypervisor is detected. * X86_FEATURE_TSC_KNOWN_FREQ This is the most harmless feature. It is set by various hypervisors, (including kvmclock), and for some Intel models, when TSC frequency can be obtained via an interface (PV, cpuid, msr, etc), rather than measuring it. * X86_FEATURE_NONSTOP_TSC On real hardware, this feature is set when CPUID has the 'invtsc' bit set. And it tells the kernel that TSC doesn't stop in low power idle states. When absent, an attempt to enter a low power idle state (e.g C2) will mark the TSC as unstable. This feature has also a friend called X86_FEATURE_NONSTOP_TSC_S3, which doesn't do anything that is relevant to KVM. In a VM, on one hand the vCPU is interrupted often but as long as the host TSC is stable, the guest TSC should remain stable as well. (The guest TSC 'keeps on running' when the guest CPU is not running, which is the same thing as the situation in which a real CPU is in low power/waiting state) However not exposing this bit to the guest doesn't cause much harm, since the guest usually doesn't use idle states, thus it never marks the TSC as unstable due to lack of this bit. (the exception to that is cpu-pm=on thing, which is TBD) * X86_FEATURE_CONSTANT_TSC On real hardware this bit informs the kernel that the TSC frequency doesn't change with CPU frequency changes. If it is not set, on first cpufreq update, the tsc is marked as unstable. On real hardware this bit is also set by 'invtsc' CPUID bit, plus set on few older intel models which lack it but still have constant TSC. In a VM, once again there is no cpufreq driver, thus the lack of this bit doesn't cause much harm. However if a VM is migrated to a host with a different CPU frequency, and TSC scaling is not supported then the guest will see a jump in the frequency. This is why qemu doesn't enable 'invtsc' by default, and blocks migration when enabled, unless you also tell explicitly what frequency the guest’s TSC should run at and in this case it will attempt to scale host TSC to this value, using hardware means and I think qemu will fail vm start if it fails. * X86_FEATURE_TSC_RELIABLE This is feature only for virtualization (although I have seen some Atom specific code piggyback on it), and it makes the guest trust the TSC more (kind of 'trust us, you can count on TSC') It makes the guest do two things. 1. Disable the clocksource watchdog (which is a periodic mechanism, by which the kernel compares TSC with other clocksources, and if it sees something fishy, it marks the tsc as unstable) 2. Disable TSC sync on new vCPU hotplug, instead relying on hypervisor to sync it. (IMHO we should set this flag for KVM as well)