Re: [PATCH] KVM: x86: Refine calculation of guest wall clock to use a single TSC read

Dongli Zhang <dongli.zhang@xxxxxxxxxx> · Sun, 1 Oct 2023 22:14:28 -0700

Hi David,

On 10/1/23 10:54, David Woodhouse wrote:
> From: David Woodhouse <dwmw@xxxxxxxxxxxx>
> 
> When populating the guest's PV wall clock information, KVM currently does
> a simple 'kvm_get_real_ns() - get_kvmclock_ns(kvm)'. This is an antipattern
> which should be avoided; when working with the relationship between two
> clocks, it's never correct to obtain one of them "now" and then the other
> at a slightly different "now" after an unspecified period of preemption
> (which might not even be under the control of the kernel, if this is an
> L1 hosting an L2 guest under nested virtualization).
> 
> Add a kvm_get_wall_clock_epoch() function to return the guest wall clock
> epoch in nanoseconds using the same method as __get_kvmclock() — by using
> kvm_get_walltime_and_clockread() to calculate both the wall clock and KVM
> clock time from a *single* TSC reading.
> 
> The condition using get_cpu_tsc_khz() is equivalent to the version in
> __get_kvmclock() which separately checks for the CONSTANT_TSC feature or
> the per-CPU cpu_tsc_khz. Which is what get_cpu_tsc_khz() does anyway.
> 
> Signed-off-by: David Woodhouse <dwmw@xxxxxxxxxxxx>
> ---
> Tested by sticking a printk into it and comparing the values calculated
> by the old and new methods, while running the xen_shinfo_test which
> keeps relocating the shared info and thus rewriting the PV wall clock
> information to it.
> 
> They look sane enough but there's still skew over time (in both of
> them) as the KVM values get adjusted in presumably similarly sloppy
> ways. But we'll get to this. This patch is just the first low hanging

About the "skew over time", would you mean the results of
kvm_get_wall_clock_epoch() keeps changing over time?

Although without testing, I suspect it is because of two reasons:

1. Would you mind explaining what does "as the KVM values get adjusted" mean?

The kvm_get_walltime_and_clockread() call is based host monotonic clock, which
may be adjusted (unlike raw monotonic).

2. The host monotonic clock and kvmclock may use different mult/shift.

The equation is A-B.

A is the current host wall clock, while B is for how long the VM has boot.

A-B will be the wallclock when VM is boot.

A: ts.tv_nsec + NSEC_PER_SEC * ts.tv_sec       --> monotonic clock
B: __pvclock_read_cycles(&hv_clock, host_tsc); --> raw monotonic and kvmclock

The A is from kvm_get_walltime_and_clockread() to get a pair of ns and tsc. It
is based on monotonic clock, e.g., gtod->clock.shift and gtod->clock.mult.

BTW, the master clock is derived from raw monotonic, which uses
gtod->raw_clock.shift and gtod->raw_clock.mult.

However, the incremental between host_tsc and master clock will be based on the
mult/shift from kvmclock (indeed kvm_get_time_scale()).

Ideally, we may expect A and B increase in the same speed. Due to that they may
use different mult/shift/equation, A and B may increase in the different speed.

About the 2nd reason, I have a patch in progress to refresh the master clock
periodically, for the clock drift during CPU hotplug.

https://lore.kernel.org/all/20230926230649.67852-1-dongli.zhang@xxxxxxxxxx/

Please correct me if the above understanding is wrong.

Thank you very much!

Dongli Zhang

> fruit in a quest to eliminate such sloppiness and get to the point
> where we can do live update (pause guests, kexec and resume them again)
> with a *single* cycle of skew. After all, the host TSC is still just as
> trustworthy so there's no excuse for *anything* changing over the
> kexec. Every other clock in the guest should have precisely the *same*
> relationship to the host TSC as it did before the kexec.
> 
>  arch/x86/kvm/x86.c | 60 ++++++++++++++++++++++++++++++++++++++++------
>  arch/x86/kvm/x86.h |  2 ++
>  arch/x86/kvm/xen.c |  4 ++--
>  3 files changed, 57 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 04b57a336b34..625ec4d9281b 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2317,14 +2317,9 @@ static void kvm_write_wall_clock(struct kvm *kvm, gpa_t wall_clock, int sec_hi_o
>  	if (kvm_write_guest(kvm, wall_clock, &version, sizeof(version)))
>  		return;
>  
> -	/*
> -	 * The guest calculates current wall clock time by adding
> -	 * system time (updated by kvm_guest_time_update below) to the
> -	 * wall clock specified here.  We do the reverse here.
> -	 */
> -	wall_nsec = ktime_get_real_ns() - get_kvmclock_ns(kvm);
> +	wall_nsec = kvm_get_wall_clock_epoch(kvm);
>  
> -	wc.nsec = do_div(wall_nsec, 1000000000);
> +	wc.nsec = do_div(wall_nsec, NSEC_PER_SEC);
>  	wc.sec = (u32)wall_nsec; /* overflow in 2106 guest time */
>  	wc.version = version;
>  
> @@ -3229,6 +3224,57 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
>  	return 0;
>  }
>  
> +uint64_t kvm_get_wall_clock_epoch(struct kvm *kvm)
> +{
> +	/*
> +	 * The guest calculates current wall clock time by adding
> +	 * system time (updated by kvm_guest_time_update below) to the
> +	 * wall clock specified here.  We do the reverse here.
> +	 */
> +#ifdef CONFIG_X86_64
> +	struct pvclock_vcpu_time_info hv_clock;
> +	struct kvm_arch *ka = &kvm->arch;
> +	unsigned long seq, local_tsc_khz = 0;
> +	struct timespec64 ts;
> +	uint64_t host_tsc;
> +
> +	do {
> +		seq = read_seqcount_begin(&ka->pvclock_sc);
> +
> +		if (!ka->use_master_clock)
> +			break;
> +
> +		/* It all has to happen on the same CPU */
> +		get_cpu();
> +
> +		local_tsc_khz = get_cpu_tsc_khz();
> +
> +		if (local_tsc_khz &&
> +		    !kvm_get_walltime_and_clockread(&ts, &host_tsc))
> +			local_tsc_khz = 0; /* Fall back to old method */
> +
> +		hv_clock.tsc_timestamp = ka->master_cycle_now;
> +		hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset;
> +
> +		put_cpu();
> +	} while (read_seqcount_retry(&ka->pvclock_sc, seq));
> +
> +	/*
> +	 * If the conditions were right, and obtaining the wallclock+TSC was
> +	 * successful, calculate the KVM clock at the corresponding time and
> +	 * subtract one from the other to get the epoch in nanoseconds.
> +	 */
> +	if (local_tsc_khz) {
> +		kvm_get_time_scale(NSEC_PER_SEC, local_tsc_khz * 1000LL,
> +				   &hv_clock.tsc_shift,
> +				   &hv_clock.tsc_to_system_mul);
> +		return ts.tv_nsec + NSEC_PER_SEC * ts.tv_sec -
> +			__pvclock_read_cycles(&hv_clock, host_tsc);
> +	}
> +#endif
> +	return ktime_get_real_ns() - get_kvmclock_ns(kvm);
> +}
> +
>  /*
>   * kvmclock updates which are isolated to a given vcpu, such as
>   * vcpu->cpu migration, should not allow system_timestamp from
> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> index c544602d07a3..b21743526011 100644
> --- a/arch/x86/kvm/x86.h
> +++ b/arch/x86/kvm/x86.h
> @@ -290,6 +290,8 @@ static inline bool kvm_check_has_quirk(struct kvm *kvm, u64 quirk)
>  	return !(kvm->arch.disabled_quirks & quirk);
>  }
>  
> +uint64_t kvm_get_wall_clock_epoch(struct kvm *kvm);
> +
>  void kvm_inject_realmode_interrupt(struct kvm_vcpu *vcpu, int irq, int inc_eip);
>  
>  u64 get_kvmclock_ns(struct kvm *kvm);
> diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
> index 75586da134b3..6bab715be428 100644
> --- a/arch/x86/kvm/xen.c
> +++ b/arch/x86/kvm/xen.c
> @@ -59,7 +59,7 @@ static int kvm_xen_shared_info_init(struct kvm *kvm, gfn_t gfn)
>  		 * This code mirrors kvm_write_wall_clock() except that it writes
>  		 * directly through the pfn cache and doesn't mark the page dirty.
>  		 */
> -		wall_nsec = ktime_get_real_ns() - get_kvmclock_ns(kvm);
> +		wall_nsec = kvm_get_wall_clock_epoch(kvm);
>  
>  		/* It could be invalid again already, so we need to check */
>  		read_lock_irq(&gpc->lock);
> @@ -98,7 +98,7 @@ static int kvm_xen_shared_info_init(struct kvm *kvm, gfn_t gfn)
>  	wc_version = wc->version = (wc->version + 1) | 1;
>  	smp_wmb();
>  
> -	wc->nsec = do_div(wall_nsec,  1000000000);
> +	wc->nsec = do_div(wall_nsec, NSEC_PER_SEC);
>  	wc->sec = (u32)wall_nsec;
>  	*wc_sec_hi = wall_nsec >> 32;
>  	smp_wmb();