Re: [PATCH RFC 1/1] KVM: x86: add param to update master clock periodically

Dongli Zhang <dongli.zhang@xxxxxxxxxx> · Mon, 2 Oct 2023 18:32:55 -0700

Hi Sean,

A quick question ...

On 10/2/23 17:53, Sean Christopherson wrote:
> On Mon, Oct 02, 2023, David Woodhouse wrote:
>> On Mon, 2023-10-02 at 09:37 -0700, Sean Christopherson wrote:
>>> On Mon, Oct 02, 2023, David Woodhouse wrote:
>>>> On Fri, 2023-09-29 at 13:15 -0700, Dongli Zhang wrote:
>>>>>
>>>>> 1. The vcpu->hv_clock (kvmclock) is based on its own mult/shift/equation.
>>>>>
>>>>> 2. The raw monotonic (tsc_clocksource) uses different mult/shift/equation.
>>>>>
>>>>
>>>> That just seems wrong. I don't mean that you're incorrect; it seems
>>>> *morally* wrong.
>>>>
>>>> In a system with X86_FEATURE_CONSTANT_TSC, why would KVM choose to use
>>>> a *different* mult/shift/equation (your #1) to convert TSC ticks to
>>>> nanoseconds than the host CLOCK_MONOTONIC_RAW does (your #2).
>>>>
>>>> I understand that KVM can't track the host's CLOCK_MONOTONIC, as it's
>>>> adjusted by NTP. But CLOCK_MONOTONIC_RAW is supposed to be consistent.
>>>>
>>>> Fix that, and the whole problem goes away, doesn't it?
>>>>
>>>> What am I missing here, that means we can't do that?
>>>
>>> I believe the answer is that "struct pvclock_vcpu_time_info" and its math are
>>> ABI between KVM and KVM guests.
>>>
>>> Like many of the older bits of KVM, my guess is that KVM's behavior is the product
>>> of making things kinda sorta work with old hardware, i.e. was probably the least
>>> awful solution in the days before constant TSCs, but is completely nonsensical on
>>> modern hardware.
>>
>> I still don't understand. The ABI and its math are fine. The ABI is just
>>  "at time X the TSC was Y, and the TSC frequency is Z"
>>
>> I understand why on older hardware, those values needed to *change*
>> occasionally when TSC stupidity happened. 
>>
>> But on newer hardware, surely we can set them precisely *once* when the
>> VM starts, and never ever have to change them again? Theoretically not
>> even when we pause the VM, kexec into a new kernel, and resume the VM!
>>
>> But we *are* having to change it, because apparently
>> CLOCK_MONOTONIC_RAW is doing something *other* than incrementing at
>> precisely the frequency of the known and constant TSC.
>>
>> But *why* is CLOCK_MONOTONIC_RAW doing that? I thought that the whole
>> point of CLOCK_MONOTONIC_RAW was to be consistent and not adjusted by
>> NTP etc.? Shouldn't it run at precisely the same rate as the kvmclock,
>> with no skew at all?
> 
> IIUC, the issue is that the paravirt clock ends up mixing time domains, i.e. is
> a weird bastardization of the host's monotonic raw clock and the paravirt clock.
> 
> Despite a name that suggests otherwise (to me at least), __pvclock_read_cycles()
> counts "cycles" in nanoseconds, not TSC ticks.
>  
>   u64 __pvclock_read_cycles(const struct pvclock_vcpu_time_info *src, u64 tsc)
>   {
> 	u64 delta = tsc - src->tsc_timestamp;
> 	u64 offset = pvclock_scale_delta(delta, src->tsc_to_system_mul,
> 					     src->tsc_shift);
> 	return src->system_time + offset;
>   }
> 
> In the above, "offset" is computed in the kvmclock domain, whereas system_time
> comes from the host's CLOCK_MONOTONIC_RAW domain by way of master_kernel_ns.
> The goofy math is much more obvious in __get_kvmclock(), i.e. KVM's host-side
> retrieval of the guest's view of kvmclock:
> 
>   hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset;
> 
> The two domains use the same "clock" (constant TSC), but different math to compute
> nanoseconds from a given TSC value.  For decently large TSC values, this results
> in CLOCK_MONOTONIC_RAW and kvmclock computing two different times in nanoseconds.
> 
> When KVM decides to refresh the masterclock, e.g. vCPU hotplug in Dongli's case,
> it resets the base time, a.k.a. system_time.  I.e. KVM takes all of the time that
> was accumulated in the kvmclock domain and recomputes it in the CLOCK_MONOTONIC_RAW
> domain.  The more time that passes between refreshes, the bigger the time jump
> from the guest's perspective.
> 
> E.g. IIUC, your proposed patch to use a single RDTSC[*] eliminates the drift by
> undoing the CLOCK_MONOTONIC_RAW crap using the same TSC value on both the "add"
> and the "subtract", but the underlying train wreck of mixing time domains is
> still there.
> 
> Without a constant TSC, deferring the reference time to the host's computation
> makes sense (or at least, is less silly), because the effective TSC would be per
> physical CPU, whereas the reference time is per VM.
> 
> [*] https://urldefense.com/v3/__https://lore.kernel.org/all/ee446c823002dc92c8ea525f21d00a9f5d27de59.camel@infradead.org__;!!ACWV5N9M2RV99hQ!L3CeHeyvOUGoCmVUUQvSn86OuB-4ZJVQ8VEm-r5hq-stW5n41h1m0K9-M8GI_abiKujrexcj-5IpD74nBA$ 
> 
>> And if CLOCK_MONOTONIC_RAW is not what I thought it was... do we really
>> have to keep resetting the kvmclock to it at all? On modern hardware
>> can't the kvmclock be defined by the TSC alone?
> 
> I think there is still use for synchronizing with the host's view of time, e.g.
> to deal with lost time across host suspend+resume.
> 
> So I don't think we can completely sever KVM's paravirt clocks from host time,
> at least not without harming use cases that rely on the host's view to keep
> accurate time.  And honestly at that point, the right answer would be to stop
> advertising paravirt clocks entirely.
> 
> But I do think we can address the issues that Dongli and David are obversing
> where guest time drifts even though the host kernel's base time hasn't changed.
> If I've pieced everything together correctly, the drift can be eliminated simply
> by using the paravirt clock algorithm when converting the delta from the raw TSC
> to nanoseconds.
> 
> This is *very* lightly tested, as in it compiles and doesn't explode, but that's
> about all I've tested.
> 
> ---
>  arch/x86/kvm/x86.c | 62 +++++++++++++++++++++++++++++++---------------
>  1 file changed, 42 insertions(+), 20 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 6573c89c35a9..3ba7edfca47c 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2417,6 +2417,9 @@ static void kvm_get_time_scale(uint64_t scaled_hz, uint64_t base_hz,
>  static atomic_t kvm_guest_has_master_clock = ATOMIC_INIT(0);
>  #endif
>  
> +static u32 host_tsc_to_system_mul;
> +static s8 host_tsc_shift;
> +
>  static DEFINE_PER_CPU(unsigned long, cpu_tsc_khz);
>  static unsigned long max_tsc_khz;
>  
> @@ -2812,27 +2815,18 @@ static u64 read_tsc(void)
>  static inline u64 vgettsc(struct pvclock_clock *clock, u64 *tsc_timestamp,
>  			  int *mode)
>  {
> -	u64 tsc_pg_val;
> -	long v;
> +	u64 ns, v;
>  
>  	switch (clock->vclock_mode) {
>  	case VDSO_CLOCKMODE_HVCLOCK:
> -		if (hv_read_tsc_page_tsc(hv_get_tsc_page(),
> -					 tsc_timestamp, &tsc_pg_val)) {
> -			/* TSC page valid */
> +		if (hv_read_tsc_page_tsc(hv_get_tsc_page(), tsc_timestamp, &v))
>  			*mode = VDSO_CLOCKMODE_HVCLOCK;
> -			v = (tsc_pg_val - clock->cycle_last) &
> -				clock->mask;
> -		} else {
> -			/* TSC page invalid */
> +		else
>  			*mode = VDSO_CLOCKMODE_NONE;
> -		}
>  		break;
>  	case VDSO_CLOCKMODE_TSC:
>  		*mode = VDSO_CLOCKMODE_TSC;
> -		*tsc_timestamp = read_tsc();
> -		v = (*tsc_timestamp - clock->cycle_last) &
> -			clock->mask;
> +		v = *tsc_timestamp = read_tsc();
>  		break;
>  	default:
>  		*mode = VDSO_CLOCKMODE_NONE;
> @@ -2840,8 +2834,36 @@ static inline u64 vgettsc(struct pvclock_clock *clock, u64 *tsc_timestamp,
>  
>  	if (*mode == VDSO_CLOCKMODE_NONE)
>  		*tsc_timestamp = v = 0;
> +	else
> +		v = (v - clock->cycle_last) & clock->mask;
>  
> -	return v * clock->mult;
> +	ns = clock->base_cycles;
> +
> +	/*
> +	 * When the clock source is a raw, constant TSC, do the conversion to
> +	 * nanoseconds using the paravirt clock math so that the delta in ns is
> +	 * consistent regardless of whether the delta is converted in the host
> +	 * or the guest.
> +	 *
> +	 * The base for paravirt clocks is the kernel's base time in ns, plus
> +	 * the delta since the last sync.   E.g. if a masterclock update occurs,
> +	 * KVM will shift some amount of delta from the guest to the host.
> +	 * Conversions from TSC to ns for the hosts's CLOCK_MONOTONIC_RAW and
> +	 * paravirt clocks aren't equivalent, and so shifting the delta can
> +	 * cause time to jump from the guest's view of the paravirt clock.
> +	 * This only works for a constant TSC, otherwise the calculation would
> +	 * only be valid for the current physical CPU, whereas the base of the
> +	 * clock must be valid for all vCPUs in the VM.
> +	 */
> +	if (static_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
> +	    *mode == VDSO_CLOCKMODE_TSC && clock == &pvclock_gtod_data.raw_clock) {
> +		ns >>= clock->shift;
> +		ns += pvclock_scale_delta(v, host_tsc_to_system_mul, host_tsc_shift);
> +	} else {
> +		ns += v * clock->mult;
> +		ns >>= clock->shift;
> +	}
> +	return ns;
>  }
>  
>  static int do_monotonic_raw(s64 *t, u64 *tsc_timestamp)
> @@ -2853,9 +2875,7 @@ static int do_monotonic_raw(s64 *t, u64 *tsc_timestamp)
>  
>  	do {
>  		seq = read_seqcount_begin(&gtod->seq);
> -		ns = gtod->raw_clock.base_cycles;
> -		ns += vgettsc(&gtod->raw_clock, tsc_timestamp, &mode);
> -		ns >>= gtod->raw_clock.shift;
> +		ns = vgettsc(&gtod->raw_clock, tsc_timestamp, &mode);
>  		ns += ktime_to_ns(ktime_add(gtod->raw_clock.offset, gtod->offs_boot));
>  	} while (unlikely(read_seqcount_retry(&gtod->seq, seq)));
>  	*t = ns;
> @@ -2873,9 +2893,7 @@ static int do_realtime(struct timespec64 *ts, u64 *tsc_timestamp)
>  	do {
>  		seq = read_seqcount_begin(&gtod->seq);
>  		ts->tv_sec = gtod->wall_time_sec;
> -		ns = gtod->clock.base_cycles;
> -		ns += vgettsc(&gtod->clock, tsc_timestamp, &mode);
> -		ns >>= gtod->clock.shift;
> +		ns = vgettsc(&gtod->clock, tsc_timestamp, &mode);
>  	} while (unlikely(read_seqcount_retry(&gtod->seq, seq)));
>  
>  	ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns);
> @@ -12185,6 +12203,10 @@ int kvm_arch_hardware_enable(void)
>  	if (ret != 0)
>  		return ret;
>  
> +	if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC))
> +		kvm_get_time_scale(NSEC_PER_SEC, tsc_khz * 1000LL,
> +				   &host_tsc_shift, &host_tsc_to_system_mul);

I agree that to use the kvmclock to calculate the ns elapsed when updating the
master clock.

Would you take the tsc scaling into consideration?

While the host_tsc_shift and host_tsc_to_system_mul are pre-computed, how about
the VM using different TSC frequency?

Thank you very much!

Dongli Zhang

> +
>  	local_tsc = rdtsc();
>  	stable = !kvm_check_tsc_unstable();
>  	list_for_each_entry(kvm, &vm_list, vm_list) {
> 
> base-commit: e2c8c2928d93f64b976b9242ddb08684b8cdea8d