Re: [PATCH 0/2] RFC: Precise TSC migration (summary)

Maxim Levitsky <mlevitsk@xxxxxxxxxx> · Mon, 30 Nov 2020 15:38:18 +0200

This is the summary of few things that I think are relevant.

Best regards,
	Maxim Levitsky
# Random unsynchronized ramblings about the TSC in KVM/Linux

## The KVM's master clock

Under assumption that

a. Host TSC is synchronized and stable (wasn't marked as unstable).

b. Guest TSC is synchronized:

   - When guest starts running, all its vCPUs start from TSC = 0.

   - If we hotplug a vCPU, its TSC is set to 0, but the 'kvm_synchronize_tsc'
     (it was used to be called kvm_write_tsc), practically speaking,
     just sets the TSC to the same value as other vCPUs are having right now.

     Later Linux will try to sync it again, but since it is already synchronized
     it won't do anything. Otherwise it uses IA_32_MSR_TSC_ADJUST to adjust it.

   - If the guest writes the TSC we try to detect if TSC is still synced.
     (We don't handle TSC adjustments done via IA_32_MSR_TSC_ADJUST).

Then the kvmclock is driven by a single pair of (nsecs, tsc) and that is used
to update the kvmclock on all vCPUs.

The advantage of this is that no random error can be introduced by calculating
this pair on each CPU.

Plus another advantage is that being vCPU invariant, guest's kvmclock
implementation can read it in userspace.
This is signaled by setting KVM_CLOCK_TSC_STABLE flag via kvmclock interface.

## KVM behavior when the host tsc is detected as unstable

* On each 'userspace' VM entry (aka 'vcpu_load'), we set guest tsc to its
  last value captured on last userspace VM exit,
  and we schedule a KVM clock update (KVM_REQ_GLOBAL_CLOCK_UPDATE)

* On each KVM clock update, we 'catchup' the tsc to the kernel clock.

* We don't use masterclock

## The TSC 'features' in the Linux kernel

Linux kernel has roughly speaking 4 cpu 'features' that define its treatment of TSC.
Some of these features are set from CPUID, some are set when certain CPU 
models are detected, and some are set when a specific hypervisor is detected.

* X86_FEATURE_TSC_KNOWN_FREQ
  This is the most harmless feature. It is set by various hypervisors,
  (including kvmclock), and for some Intel models, when TSC frequency can
  be obtained via an interface (PV, cpuid, msr, etc), rather than measuring it.

* X86_FEATURE_NONSTOP_TSC

  On real hardware, this feature is set when CPUID has the 'invtsc' bit set.
  And it tells the kernel that TSC doesn't stop in low power idle states.
  When absent, an attempt to enter a low power idle state (e.g C2) will mark
  the TSC as unstable.

  This feature has also a friend called X86_FEATURE_NONSTOP_TSC_S3,
  which doesn't do anything that is relevant to KVM.

  In a VM, on one hand the vCPU is interrupted often but as long as the host TSC
  is stable, the guest TSC should remain stable as well.

  (The guest TSC 'keeps on running' when the guest CPU is not 
  running, which is the same thing as the situation in which
  a real CPU is in low power/waiting state)

  However not exposing this bit to the guest doesn't cause much harm, since the
  guest usually doesn't use idle states, thus it never marks the TSC 
  as unstable due to lack of this bit.
  (the exception to that is cpu-pm=on thing, which is TBD)

 * X86_FEATURE_CONSTANT_TSC

  On real hardware this bit informs the kernel that the TSC frequency doesn't
  change with CPU frequency changes. If it is not set, on first cpufreq update,
  the tsc is marked as unstable.
  On real hardware this bit is also set by 'invtsc' CPUID bit, plus set on few
  older intel models which lack it but still have constant TSC.

  In a VM, once again there is no cpufreq driver, thus the lack of this bit 
  doesn't cause much harm.

  However if a VM is migrated to a host with a different CPU frequency,
  and TSC scaling is not supported then the guest will see a jump in the
  frequency.

  This is why qemu doesn't enable 'invtsc' by default, and blocks migration
  when enabled, unless you also tell explicitly what frequency the guest’s TSC
  should run at and in this case it will attempt to scale host TSC to this 
  value, using hardware means and I think qemu will fail vm start 
  if it fails.

* X86_FEATURE_TSC_RELIABLE

  This is feature only for virtualization (although I have seen some Atom
  specific code piggyback on it), and it makes the guest trust the TSC more
  (kind of 'trust us, you can count on TSC')

  It makes the guest do two things.
  1. Disable the clocksource watchdog (which is a periodic mechanism,
     by which the kernel compares TSC with other clocksources,
     and if it sees something fishy, it marks the tsc as unstable)

  2. Disable TSC sync on new vCPU hotplug, instead relying on hypervisor
     to sync it.
     (IMHO we should set this flag for KVM as well)