Re: [PATCH 0/2] RFC: Precise TSC migration

Thomas Gleixner <tglx@xxxxxxxxxxxxx> · Tue, 01 Dec 2020 20:35:48 +0100

On Mon, Nov 30 2020 at 15:35, Maxim Levitsky wrote:
> The idea of masterclock is that when the host TSC is synchronized
> (or as kernel call it, stable), and the guest TSC is synchronized as well,
> then we can base the kvmclock, on the same pair of
> (host time in nsec, host tsc value), for all vCPUs.

That's correct. All guest CPUs should see exactly the same TSC value,
i.e.

        hostTSC + vcpu_offset

> This makes the random error in calculation of this value invariant
> across vCPUS, and allows the guest to do kvmclock calculation in userspace
> (vDSO) since kvmclock parameters are vCPU invariant.

That's not the case today? OMG!

> To ensure that the guest tsc is synchronized we currently track host/guest tsc
> writes, and enable the master clock only when roughly the same guest's TSC value
> was written across all vCPUs.

The Linux kernel never writes the TSC. We've tried that ~15 years ago
and it was a total disaster.

> Recently this was disabled by Paulo and I agree with this, because I think
> that we indeed should only make the guest TSC synchronized by default
> (including new hotplugged vCPUs) and not do any tsc synchronization beyond that.
> (Trying to guess when the guest syncs the TSC can cause more harm that good).
>
> Besides, Linux guests don't sync the TSC via IA32_TSC write,
> but rather use IA32_TSC_ADJUST which currently doesn't participate
> in the tsc sync heruistics.

The kernel only writes TSC_ADJUST when it is advertised in CPUID and:

    1) when the boot CPU detects a non-zero TSC_ADJUST value it writes
       it to 0, except when running on SGI-UV

    2) when a starting CPU has a different TSC_ADJUST value than the
       first CPU which came up on the same socket.

    3) When the first CPU of a different socket is starting and the TSC
       synchronization check fails against a CPU on an already checked
       socket then the kernel tries to adjust TSC_ADJUST to the point
       that the synchronization check does not fail anymore.

> And as far as I know, Linux guest is the primary (only?) user of the kvmclock.
>
> I *do think* however that we should redefine KVM_CLOCK_TSC_STABLE
> in the documentation to state that it only guarantees invariance if the guest
> doesn't mess with its own TSC.
>
> Also I think we should consider enabling the X86_FEATURE_TSC_RELIABLE
> in the guest kernel, when kvm is detected to avoid the guest even from trying
> to sync TSC on newly hotplugged vCPUs.
>
> (The guest doesn't end up touching TSC_ADJUST usually, but it still might
> in some cases due to scheduling of guest vCPUs)

The only cases it would try to write are #3 above or because the
hypervisor or BIOS messed it up (#1, #2).

> (X86_FEATURE_TSC_RELIABLE short circuits tsc synchronization on CPU hotplug,
> and TSC clocksource watchdog, and the later we might want to keep).

Depends. If the host TSC is stable and synchronized, then you don't need
the TSC watchdog. We are slowly starting to trust the TSC to some extent
and phase out the watchdog for newer parts (hopefully).

If the host TSC really falls apart then it still can invalidate
KVM_CLOCK and force the guest to reevaluate the situation.

> Few more random notes:
>
> I have a weird feeling about using 'nsec since 1 January 1970'.
> Common sense is telling me that a 64 bit value can hold about 580
> years,

which is plenty.

> but still I see that it is more common to use timespec which is a
> (sec,nsec) pair.

timespecs are horrible.

Thanks,

        tglx