On Wed, May 11, 2022, Anton Romanov wrote: > Don't snapshot tsc_khz into per-cpu cpu_tsc_khz if the host TSC is > constant, in which case the actual TSC frequency will never change and thus > capturing TSC during initialization is unnecessary, KVM can simply use > tsc_khz. This value is snapshotted from > kvm_timer_init->kvmclock_cpu_online->tsc_khz_changed(NULL) > > On CPUs with constant TSC, but not a hardware-specified TSC frequency, > snapshotting cpu_tsc_khz and using that to set a VM's target TSC frequency > can lead to VM to think its TSC frequency is not what it actually is if > refining the TSC completes after KVM snapshots tsc_khz. The actual > frequency never changes, only the kernel's calculation of what that > frequency is changes. > > Ideally, KVM would not be able to race with TSC refinement, or would have > a hook into tsc_refine_calibration_work() to get an alert when refinement > is complete. Avoiding the race altogether isn't practical as refinement > takes a relative eternity; it's deliberately put on a work queue outside of > the normal boot sequence to avoid unnecessarily delaying boot. > > Adding a hook is doable, but somewhat gross due to KVM's ability to be > built as a module. And if the TSC is constant, which is likely the case > for every VMX/SVM-capable CPU produced in the last decade, the race can be > hit if and only if userspace is able to create a VM before TSC refinement > completes; refinement is slow, but not that slow. > > For now, punt on a proper fix, as not taking a snapshot can help some uses > cases and not taking a snapshot is arguably correct irrespective of the > race with refinement. > > Signed-off-by: Anton Romanov <romanton@xxxxxxxxxx> > --- Reviewed-by: Sean Christopherson <seanjc@xxxxxxxxxx> > @@ -8807,10 +8828,10 @@ static void kvm_timer_init(void) > #endif > cpufreq_register_notifier(&kvmclock_cpufreq_notifier_block, > CPUFREQ_TRANSITION_NOTIFIER); > - } > > - cpuhp_setup_state(CPUHP_AP_X86_KVM_CLK_ONLINE, "x86/kvm/clk:online", > - kvmclock_cpu_online, kvmclock_cpu_down_prep); > + cpuhp_setup_state(CPUHP_AP_X86_KVM_CLK_ONLINE, "x86/kvm/clk:online", > + kvmclock_cpu_online, kvmclock_cpu_down_prep); > + } > } One final thought, it might be easier for readers if kvm_timer_init() became: static void kvm_timer_init(void) { if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) return; max_tsc_khz = tsc_khz; if (IS_ENABLED(CONFIG_CPU_FREQ)) { struct cpufreq_policy *policy; int cpu; cpu = get_cpu(); policy = cpufreq_cpu_get(cpu); if (policy) { if (policy->cpuinfo.max_freq) max_tsc_khz = policy->cpuinfo.max_freq; cpufreq_cpu_put(policy); } put_cpu(); } cpufreq_register_notifier(&kvmclock_cpufreq_notifier_block, CPUFREQ_TRANSITION_NOTIFIER); cpuhp_setup_state(CPUHP_AP_X86_KVM_CLK_ONLINE, "x86/kvm/clk:online", kvmclock_cpu_online, kvmclock_cpu_down_prep); } I think I have a slight preference for the above? Either way is totally fine. Maybe wait for Paolo to weigh in, or even let Paolo do it as fixup?