Hi Sean, On 10/16/23 11:49, Sean Christopherson wrote: > On Mon, Oct 16, 2023, David Woodhouse wrote: >> On Mon, 2023-10-16 at 08:47 -0700, Dongli Zhang wrote: >>> Suppose we are discussing a non-permanenet solution, I would suggest: >>> >>> 1. Document something to accept that kvm-clock (or pvclock on KVM, including Xen >>> on KVM) is not good enough in some cases, e.g., vCPU hotplug. >> >> I still don't understand the vCPU hotplug case. >> >> In the case where the TSC is actually sane, why would we need to reset >> the masterclock on vCPU hotplug? >> >> The new vCPU gets its TSC synchronised to the others, and its kvmclock >> parameters (mul/shift/offset based on the guest TSC) can be *precisely* >> the same as the other vCPUs too, can't they? Why reset anything? > > Aha! I think I finally figured out why KVM behaves the way it does. > > The unnecessary masterclock updates effectively came from: > > commit 7f187922ddf6b67f2999a76dcb71663097b75497 > Author: Marcelo Tosatti <mtosatti@xxxxxxxxxx> > Date: Tue Nov 4 21:30:44 2014 -0200 > > KVM: x86: update masterclock values on TSC writes > > When the guest writes to the TSC, the masterclock TSC copy must be > updated as well along with the TSC_OFFSET update, otherwise a negative > tsc_timestamp is calculated at kvm_guest_time_update. > > Once "if (!vcpus_matched && ka->use_master_clock)" is simplified to > "if (ka->use_master_clock)", the corresponding "if (!ka->use_master_clock)" > becomes redundant, so remove the do_request boolean and collapse > everything into a single condition. > > Before that, KVM only re-synced the masterclock if it was enabled or disabled, > i.e. KVM behaved as we want it to behave. Note, at the time of the above commit, > VMX synchronized TSC on *guest* writes to MSR_IA32_TSC: > > case MSR_IA32_TSC: > kvm_write_tsc(vcpu, msr_info); > break; > > That got changed by commit 0c899c25d754 ("KVM: x86: do not attempt TSC synchronization > on guest writes"), but I don't think the guest angle is actually relevant to the > fix. AFAICT, a write from the host would suffer the same problem. But knowing > that KVM synced on guest writes is crucial to understanding the changelog. > > In kvm_write_tsc(), except for KVM's wonderful "less than 1 second" hack, KVM > snapshotted the vCPU's current TSC *and* the current time in nanoseconds, where > kvm->arch.cur_tsc_nsec is the current host kernel time in nanoseconds. > > ns = get_kernel_ns(); > > ... > > if (usdiff < USEC_PER_SEC && > vcpu->arch.virtual_tsc_khz == kvm->arch.last_tsc_khz) { > ... > } else { > /* > * We split periods of matched TSC writes into generations. > * For each generation, we track the original measured > * nanosecond time, offset, and write, so if TSCs are in > * sync, we can match exact offset, and if not, we can match > * exact software computation in compute_guest_tsc() > * > * These values are tracked in kvm->arch.cur_xxx variables. > */ > kvm->arch.cur_tsc_generation++; > kvm->arch.cur_tsc_nsec = ns; > kvm->arch.cur_tsc_write = data; > kvm->arch.cur_tsc_offset = offset; > matched = false; > pr_debug("kvm: new tsc generation %llu, clock %llu\n", > kvm->arch.cur_tsc_generation, data); > } > > ... > > /* Keep track of which generation this VCPU has synchronized to */ > vcpu->arch.this_tsc_generation = kvm->arch.cur_tsc_generation; > vcpu->arch.this_tsc_nsec = kvm->arch.cur_tsc_nsec; > vcpu->arch.this_tsc_write = kvm->arch.cur_tsc_write; > > Note that the above sets matched to false! But because kvm_track_tsc_matching() > looks for matched+1, i.e. doesn't require the first vCPU to match itself, KVM > would immediately compute vcpus_matched as true for VMs with a single vCPU. As > a result, KVM would skip the masterlock update, even though a new TSC generation > was created. > > vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 == > atomic_read(&vcpu->kvm->online_vcpus)); > > if (vcpus_matched && gtod->clock.vclock_mode == VCLOCK_TSC) > if (!ka->use_master_clock) > do_request = 1; > > if (!vcpus_matched && ka->use_master_clock) > do_request = 1; > > if (do_request) > kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu); > > On hardware without TSC scaling support, vcpu->tsc_catchup is set to true if the > guest TSC frequency is faster than the host TSC frequency, even if the TSC is > otherwise stable. And for that mode, kvm_guest_time_update(), by way of > compute_guest_tsc(), uses vcpu->arch.this_tsc_nsec, a.k.a. the kernel time at the > last TSC write, to compute the guest TSC relative to kernel time: > > static u64 compute_guest_tsc(struct kvm_vcpu *vcpu, s64 kernel_ns) > { > u64 tsc = pvclock_scale_delta(kernel_ns-vcpu->arch.this_tsc_nsec, > vcpu->arch.virtual_tsc_mult, > vcpu->arch.virtual_tsc_shift); > tsc += vcpu->arch.this_tsc_write; > return tsc; > } > > Except the @kernel_ns passed to compute_guest_tsc() isn't the current kernel time, > it's the masterclock snapshot! > > spin_lock(&ka->pvclock_gtod_sync_lock); > use_master_clock = ka->use_master_clock; > if (use_master_clock) { > host_tsc = ka->master_cycle_now; > kernel_ns = ka->master_kernel_ns; > } > spin_unlock(&ka->pvclock_gtod_sync_lock); > > if (vcpu->tsc_catchup) { > u64 tsc = compute_guest_tsc(v, kernel_ns); > if (tsc > tsc_timestamp) { > adjust_tsc_offset_guest(v, tsc - tsc_timestamp); > tsc_timestamp = tsc; > } > } > > And so the "kernel_ns-vcpu->arch.this_tsc_nsec" is *guaranteed* to generate a > negative value, because this_tsc_nsec was captured after ka->master_kernel_ns. > > Forcing a masterclock update essentially fudged around that problem, but in a > heavy handed way that introduced undesirable side effects, i.e. unnecessarily > forces a masterclock update when a new vCPU joins the party via hotplug. > > Compile tested only, but the below should fix the vCPU hotplug case. Then > someone (not me) just needs to figure out why kvm_xen_shared_info_init() forces > a masterclock update. > > I still think we should clean up the periodic sync code, but I don't think we > need to periodically sync the masterclock. This looks good to me. The core idea is to not update master clock for the synchronized cases. How about the negative value case? I see in the linux code it is still there? (It is out of the scope of my expectation as I do not need to run vCPUs in different tsc freq as host) Thank you very much! Dongli Zhang > > --- > arch/x86/kvm/x86.c | 29 ++++++++++++++++------------- > 1 file changed, 16 insertions(+), 13 deletions(-) > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index c54e1133e0d3..f0a607b6fc31 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -2510,26 +2510,29 @@ static inline int gtod_is_based_on_tsc(int mode) > } > #endif > > -static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu) > +static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu, bool new_generation) > { > #ifdef CONFIG_X86_64 > - bool vcpus_matched; > struct kvm_arch *ka = &vcpu->kvm->arch; > struct pvclock_gtod_data *gtod = &pvclock_gtod_data; > > - vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 == > - atomic_read(&vcpu->kvm->online_vcpus)); > + /* > + * To use the masterclock, the host clocksource must be based on TSC > + * and all vCPUs must have matching TSCs. Note, the count for matching > + * vCPUs doesn't include the reference vCPU, hence "+1". > + */ > + bool use_master_clock = (ka->nr_vcpus_matched_tsc + 1 == > + atomic_read(&vcpu->kvm->online_vcpus)) && > + gtod_is_based_on_tsc(gtod->clock.vclock_mode); > > /* > - * Once the masterclock is enabled, always perform request in > - * order to update it. > - * > - * In order to enable masterclock, the host clocksource must be TSC > - * and the vcpus need to have matched TSCs. When that happens, > - * perform request to enable masterclock. > + * Request a masterclock update if the masterclock needs to be toggled > + * on/off, or when starting a new generation and the masterclock is > + * enabled (compute_guest_tsc() requires the masterclock snaphot to be > + * taken _after_ the new generation is created). > */ > - if (ka->use_master_clock || > - (gtod_is_based_on_tsc(gtod->clock.vclock_mode) && vcpus_matched)) > + if ((ka->use_master_clock && new_generation) || > + (ka->use_master_clock != use_master_clock)) > kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu); > > trace_kvm_track_tsc(vcpu->vcpu_id, ka->nr_vcpus_matched_tsc, > @@ -2706,7 +2709,7 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc, > vcpu->arch.this_tsc_nsec = kvm->arch.cur_tsc_nsec; > vcpu->arch.this_tsc_write = kvm->arch.cur_tsc_write; > > - kvm_track_tsc_matching(vcpu); > + kvm_track_tsc_matching(vcpu, !matched); > } > > static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 *user_value) > > base-commit: dfdc8b7884b50e3bfa635292973b530a97689f12