On 2/25/22 02:39, Sean Christopherson wrote:
Don't snapshot tsc_khz into max_tsc_khz during KVM initialization if the host TSC is constant, in which case the actual TSC frequency will never change and thus capturing the "max" TSC during initialization is unnecessary, KVM can simply use tsc_khz during VM creation. On CPUs with constant TSC, but not a hardware-specified TSC frequency, snapshotting max_tsc_khz and using that to set a VM's default TSC frequency can lead to KVM thinking it needs to manually scale the guest's TSC if refining the TSC completes after KVM snapshots tsc_khz. The actual frequency never changes, only the kernel's calculation of what that frequency is changes. On systems without hardware TSC scaling, this either puts KVM into "always catchup" mode (extremely inefficient), or prevents creating VMs altogether. Ideally, KVM would not be able to race with TSC refinement, or would have a hook into tsc_refine_calibration_work() to get an alert when refinement is complete. Avoiding the race altogether isn't practical as refinement takes a relative eternity; it's deliberately put on a work queue outside of the normal boot sequence to avoid unnecessarily delaying boot. Adding a hook is doable, but somewhat gross due to KVM's ability to be built as a module. And if the TSC is constant, which is likely the case for every VMX/SVM-capable CPU produced in the last decade, the race can be hit if and only if userspace is able to create a VM before TSC refinement completes; refinement is slow, but not that slow. For now, punt on a proper fix, as not taking a snapshot can help some uses cases and not taking a snapshot is arguably correct irrespective of the race with refinement. Cc: Suleiman Souhlal <suleiman@xxxxxxxxxx> Cc: Anton Romanov <romanton@xxxxxxxxxx> Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx>
Queued, but I'd rather have a subject that calls out that max_tsc_khz needs a replacement at vCPU creation time. In fact, the real change (and bug, and fix) is in kvm_arch_vcpu_create(), while the subject mentions only the change in kvm_timer_init().
What do you think of "KVM: x86: Use current rather than max TSC frequency if it is constant"?
Pao