On 13 September 2023 11:43:56 CEST, Like Xu <like.xu.linux@xxxxxxxxx> wrote: >> Why? Can't we treat an explicit zero write just the same as when the kernel does it? > >Not sure if it meets your simplified expectations: Think that looks good, thanks. One minor nit... >diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c >index 6c9c81e82e65..0f05cf90d636 100644 >--- a/arch/x86/kvm/x86.c >+++ b/arch/x86/kvm/x86.c >@@ -2735,20 +2735,35 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data) > * kvm_clock stable after CPU hotplug > */ > synchronizing = true; >- } else { >+ } else if (!data || kvm->arch.user_set_tsc) { If data is zero here, won't the first if() case have been taken, and set synchronizing=true? So this is equivalent to "else if (kvm->arch.user_set_tsc)". (Which is fine and what what I intended). > u64 tsc_exp = kvm->arch.last_tsc_write + > nsec_to_cycles(vcpu, elapsed); > u64 tsc_hz = vcpu->arch.virtual_tsc_khz * 1000LL; > /* >- * Special case: TSC write with a small delta (1 second) >- * of virtual cycle time against real time is >- * interpreted as an attempt to synchronize the CPU. >+ * Here lies UAPI baggage: when a user-initiated TSC write has >+ * a small delta (1 second) of virtual cycle time against the >+ * previously set vCPU, we assume that they were intended to be >+ * in sync and the delta was only due to the racy nature of the >+ * legacy API. >+ * >+ * This trick falls down when restoring a guest which genuinely >+ * has been running for less time than the 1 second of imprecision >+ * which we allow for in the legacy API. In this case, the first >+ * value written by userspace (on any vCPU) should not be subject >+ * to this 'correction' to make it sync up with values that only >+ * from from the kernel's default vCPU creation. Make the 1-second >+ * slop hack only trigger if flag is already set. >+ * >+ * The correct answer is for the VMM not to use the legacy API. > */ > synchronizing = data < tsc_exp + tsc_hz && > data + tsc_hz > tsc_exp; > } > } > >+ if (data) >+ kvm->arch.user_set_tsc = true; >+ > /* > * For a reliable TSC, we can match TSC offsets, and for an unstable > * TSC, we add elapsed time in this computation. We could let the >@@ -5536,6 +5551,7 @@ static int kvm_arch_tsc_set_attr(struct kvm_vcpu *vcpu, > tsc = kvm_scale_tsc(rdtsc(), vcpu->arch.l1_tsc_scaling_ratio) + offset; > ns = get_kvmclock_base_ns(); > >+ kvm->arch.user_set_tsc = true; > __kvm_synchronize_tsc(vcpu, offset, tsc, ns, matched); > raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags); > >