On Wed, May 22, 2024, David Woodhouse wrote: > From: David Woodhouse <dwmw@xxxxxxxxxxxx> > > When in 'master clock mode' (i.e. when host and guest TSCs are behaving > sanely and in sync), the KVM clock is defined in terms of the guest TSC. > > When TSC scaling is used, calculating the KVM clock directly from *host* > TSC cycles leads to a systemic drift from the values calculated by the > guest from its TSC. > > Commit 451a707813ae ("KVM: x86/xen: improve accuracy of Xen timers") > had a simple workaround for the specific case of Xen timers, as it had an > actual vCPU to hand and could use its scaling information. That commit > noted that it was broken for the general case of get_kvmclock_ns(), and > said "I'll come back to that". > > Since __get_kvmclock() is invoked without a specific CPU, it needs to > be able to find or generate the scaling values required to perform the > correct calculation. > > Thankfully, TSC scaling can only happen with X86_FEATURE_CONSTANT_TSC, > so it isn't as complex as it might have been. > > In __kvm_synchronize_tsc(), note the current vCPU's scaling ratio in > kvm->arch.last_tsc_scaling_ratio. That is only protected by the > tsc_write_lock, so in pvclock_update_vm_gtod_copy(), copy it into a > separate kvm->arch.master_tsc_scaling_ratio so that it can be accessed > using the kvm->arch.pvclock_sc seqcount lock. Also generate the mul and > shift factors to convert to nanoseconds for the corresponding KVM clock, > just as kvm_guest_time_update() would. > > In __get_kvmclock(), which runs within a seqcount retry loop, use those > values to convert host to guest TSC and then to nanoseconds. Only fall > back to using get_kvmclock_base_ns() when not in master clock mode. > > There was previously a code path in __get_kvmclock() which looked like > it could set KVM_CLOCK_TSC_STABLE without KVM_CLOCK_REALTIME, perhaps > even on 32-bit hosts. In practice that could never happen as the > ka->use_master_clock flag couldn't be set on 32-bit, and even on 64-bit > hosts it would never be set when the system clock isn't TSC-based. So > that code path is now removed. This should be a separate patch. Actually, patches, plural. More below > The kvm_get_wall_clock_epoch() function had the same problem; make it > just call get_kvmclock() and subtract kvmclock from wallclock, with > the same fallback as before. > > Signed-off-by: David Woodhouse <dwmw@xxxxxxxxxxxx> > --- ... > @@ -3100,36 +3131,49 @@ static unsigned long get_cpu_tsc_khz(void) > static void __get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data) > { > struct kvm_arch *ka = &kvm->arch; > - struct pvclock_vcpu_time_info hv_clock; > + > +#ifdef CONFIG_X86_64 > + uint64_t cur_tsc_khz = 0; > + struct timespec64 ts; > > /* both __this_cpu_read() and rdtsc() should be on the same cpu */ > get_cpu(); > > - data->flags = 0; > if (ka->use_master_clock && > - (static_cpu_has(X86_FEATURE_CONSTANT_TSC) || __this_cpu_read(cpu_tsc_khz))) { > -#ifdef CONFIG_X86_64 > - struct timespec64 ts; > + (cur_tsc_khz = get_cpu_tsc_khz()) && That is mean. And if you push it inside the if-statement, the {get,put}_cpu() can be avoided when the master clock isn't being used, e.g. if (ka->use_master_clock) { /* * The RDTSC needs to happen on the same CPU whose frequency is * used to compute kvmclock's time. */ get_cpu(); cur_tsc_khz = get_cpu_tsc_khz(); if (cur_tsc_khz && !kvm_get_walltime_and_clockread(&ts, &data->host_tsc)) cur_tsc_khz = 0; put_cpu(); } However, the changelog essentially claims kvm_get_walltime_and_clockread() should never fail when use_master_clock is enabled, which suggests a WARN is warranted. There was previously a code path in __get_kvmclock() which looked like it could set KVM_CLOCK_TSC_STABLE without KVM_CLOCK_REALTIME, perhaps even on 32-bit hosts. In practice that could never happen as the ka->use_master_clock flag couldn't be set on 32-bit, and even on 64-bit hosts it would never be set when the system clock isn't TSC-based. So that code path is now removed. But, I think kvm_get_walltime_and_clockread() can fail when use_master_clock is true, i.e. I don't think a WARN is viable as it could get false positives. Ah, this is protected by pvclock_sc, so a stale use_master_clock should result in a retry. What if we WARN on that? Hrm, that requires plumbing in the original sequence count. Ah, but looking at the patch as a whole, if we keep kvm_get_wall_clock_epoch()'s style, then it's much easier. And FWIW, I like the existing kvm_get_wall_clock_epoch() style a lot more than the get_kvmclock() => __get_kvmclock() approach. So, can we do this as prep patch #1? diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 9c14d0f5a684..98806a59e110 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3360,9 +3360,16 @@ uint64_t kvm_get_wall_clock_epoch(struct kvm *kvm) local_tsc_khz = get_cpu_tsc_khz(); + /* + * The master clock depends on the pvclock being based on TSC, + * so the only way kvm_get_walltime_and_clockread() can fail is + * if the clocksource changed and use_master_clock is stale, in + * which case a seqcount retry should be pending. + */ if (local_tsc_khz && - !kvm_get_walltime_and_clockread(&ts, &host_tsc)) - local_tsc_khz = 0; /* Fall back to old method */ + !kvm_get_walltime_and_clockread(&ts, &host_tsc) && + WARN_ON_ONCE(!read_seqcount_retry(&ka->pvclock_sc, seq))) + local_tsc_khz = 0; /* Fall back to old method */ put_cpu(); And then as patch(es) 2..7 (give or take) (2) fold __get_kvmclock() into get_kvmclock() (3) and the same WARN on the seqcount in get_kvmclock() (but skimp on the comments) (4) use get_kvmclock_base_ns() as the fallback in get_kvmclock(), i.e. delete the raw rdtsc() and setting of KVM_CLOCK_TSC_STABLE w/o KVM_CLOCK_REALTIME (5) use get_cpu_tsc_khz() instead of open coding something similar (6) scale TSC when computing kvmclock (the core of this patch) (7) use get_kvmclock() in kvm_get_wall_clock_epoch() as the will be 100% equivalent at this point. > + !kvm_get_walltime_and_clockread(&ts, &data->host_tsc)) > + cur_tsc_khz = 0; > > - if (kvm_get_walltime_and_clockread(&ts, &data->host_tsc)) { > - data->realtime = ts.tv_nsec + NSEC_PER_SEC * ts.tv_sec; > - data->flags |= KVM_CLOCK_REALTIME | KVM_CLOCK_HOST_TSC; > - } else > -#endif > - data->host_tsc = rdtsc(); > - > - data->flags |= KVM_CLOCK_TSC_STABLE; > - hv_clock.tsc_timestamp = ka->master_cycle_now; > - hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset; > - kvm_get_time_scale(NSEC_PER_SEC, get_cpu_tsc_khz() * 1000LL, > - &hv_clock.tsc_shift, > - &hv_clock.tsc_to_system_mul); > - data->clock = __pvclock_read_cycles(&hv_clock, data->host_tsc); > - } else { > - data->clock = get_kvmclock_base_ns() + ka->kvmclock_offset; > + put_cpu(); > + > + if (cur_tsc_khz) { > + uint64_t tsc_cycles; > + uint32_t mul; > + int8_t shift; > + > + tsc_cycles = data->host_tsc - ka->master_cycle_now; > + > + if (kvm_caps.has_tsc_control) > + tsc_cycles = kvm_scale_tsc(tsc_cycles, > + ka->master_tsc_scaling_ratio); > + > + if (static_cpu_has(X86_FEATURE_CONSTANT_TSC)) { > + mul = ka->master_tsc_mul; > + shift = ka->master_tsc_shift; > + } else { > + kvm_get_time_scale(NSEC_PER_SEC, cur_tsc_khz * 1000LL, > + &shift, &mul); > + } > + data->clock = ka->master_kernel_ns + ka->kvmclock_offset + > + pvclock_scale_delta(tsc_cycles, mul, shift); > + data->realtime = ts.tv_nsec + NSEC_PER_SEC * ts.tv_sec; > + data->flags = KVM_CLOCK_REALTIME | KVM_CLOCK_HOST_TSC | KVM_CLOCK_TSC_STABLE; > + return; > } > +#endif > > - put_cpu(); > + data->clock = get_kvmclock_base_ns() + ka->kvmclock_offset; > + data->flags = 0; > }