Re: [PATCH v5] KVM: x86/tsc: Don't sync TSC on the first write in state restoration

David Woodhouse <dwmw2@xxxxxxxxxxxxx> · Wed, 13 Sep 2023 11:51:46 +0200

On 13 September 2023 11:43:56 CEST, Like Xu <like.xu.linux@xxxxxxxxx> wrote:

>> Why? Can't we treat an explicit zero write just the same as when the kernel does it?
>
>Not sure if it meets your simplified expectations:

Think that looks good, thanks. One minor nit...

>diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>index 6c9c81e82e65..0f05cf90d636 100644
>--- a/arch/x86/kvm/x86.c
>+++ b/arch/x86/kvm/x86.c
>@@ -2735,20 +2735,35 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
> 			 * kvm_clock stable after CPU hotplug
> 			 */
> 			synchronizing = true;
>-		} else {
>+		} else if (!data || kvm->arch.user_set_tsc) {

If data is zero here, won't the first if() case have been taken, and set synchronizing=true?

So this is equivalent to "else if (kvm->arch.user_set_tsc)". (Which is fine and what what I intended).

> 			u64 tsc_exp = kvm->arch.last_tsc_write +
> 						nsec_to_cycles(vcpu, elapsed);
> 			u64 tsc_hz = vcpu->arch.virtual_tsc_khz * 1000LL;
> 			/*
>-			 * Special case: TSC write with a small delta (1 second)
>-			 * of virtual cycle time against real time is
>-			 * interpreted as an attempt to synchronize the CPU.
>+			 * Here lies UAPI baggage: when a user-initiated TSC write has
>+			 * a small delta (1 second) of virtual cycle time against the
>+			 * previously set vCPU, we assume that they were intended to be
>+			 * in sync and the delta was only due to the racy nature of the
>+			 * legacy API.
>+			 *
>+			 * This trick falls down when restoring a guest which genuinely
>+			 * has been running for less time than the 1 second of imprecision
>+			 * which we allow for in the legacy API. In this case, the first
>+			 * value written by userspace (on any vCPU) should not be subject
>+			 * to this 'correction' to make it sync up with values that only
>+			 * from from the kernel's default vCPU creation. Make the 1-second
>+			 * slop hack only trigger if flag is already set.
>+			 *
>+			 * The correct answer is for the VMM not to use the legacy API.
> 			 */
> 			synchronizing = data < tsc_exp + tsc_hz &&
> 					data + tsc_hz > tsc_exp;
> 		}
> 	}
>
>+	if (data)
>+		kvm->arch.user_set_tsc = true;
>+
> 	/*
> 	 * For a reliable TSC, we can match TSC offsets, and for an unstable
> 	 * TSC, we add elapsed time in this computation.  We could let the
>@@ -5536,6 +5551,7 @@ static int kvm_arch_tsc_set_attr(struct kvm_vcpu *vcpu,
> 		tsc = kvm_scale_tsc(rdtsc(), vcpu->arch.l1_tsc_scaling_ratio) + offset;
> 		ns = get_kvmclock_base_ns();
>
>+		kvm->arch.user_set_tsc = true;
> 		__kvm_synchronize_tsc(vcpu, offset, tsc, ns, matched);
> 		raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags);
>
>