On Wed, 2023-10-18 at 12:56 -0700, Sean Christopherson wrote: > Don't force a masterclock update when a vCPU synchronizes to the current > TSC generation, e.g. when userspace hotplugs a pre-created vCPU into the > VM. Unnecessarily updating the masterclock is undesirable as it can cause > kvmclock's time to jump, which is particularly painful on systems with a > stable TSC as kvmclock _should_ be fully reliable on such systems. > > The unexpected time jumps are due to differences in the TSC=>nanoseconds > conversion algorithms between kvmclock and the host's CLOCK_MONOTONIC_RAW > (the pvclock algorithm is inherently lossy). When updating the > masterclock, KVM refreshes the "base", i.e. moves the elapsed time since > the last update from the kvmclock/pvclock algorithm to the > CLOCK_MONOTONIC_RAW algorithm. Synchronizing kvmclock with > CLOCK_MONOTONIC_RAW is the lesser of evils when the TSC is unstable, but > adds no real value when the TSC is stable. > > Prior to commit 7f187922ddf6 ("KVM: x86: update masterclock values on TSC > writes"), KVM did NOT force an update when synchronizing a vCPU to the > current generation. > > commit 7f187922ddf6b67f2999a76dcb71663097b75497 > Author: Marcelo Tosatti <mtosatti@xxxxxxxxxx> > Date: Tue Nov 4 21:30:44 2014 -0200 > > KVM: x86: update masterclock values on TSC writes > > When the guest writes to the TSC, the masterclock TSC copy must be > updated as well along with the TSC_OFFSET update, otherwise a negative > tsc_timestamp is calculated at kvm_guest_time_update. > > Once "if (!vcpus_matched && ka->use_master_clock)" is simplified to > "if (ka->use_master_clock)", the corresponding "if (!ka->use_master_clock)" > becomes redundant, so remove the do_request boolean and collapse > everything into a single condition. > > Before that, KVM only re-synced the masterclock if the masterclock was > enabled or disabled Note, at the time of the above commit, VMX > synchronized TSC on *guest* writes to MSR_IA32_TSC: > > case MSR_IA32_TSC: > kvm_write_tsc(vcpu, msr_info); > break; > > which is why the changelog specifically says "guest writes", but the bug > that was being fixed wasn't unique to guest write, i.e. a TSC write from > the host would suffer the same problem. > > So even though KVM stopped synchronizing on guest writes as of commit > 0c899c25d754 ("KVM: x86: do not attempt TSC synchronization on guest > writes"), simply reverting commit 7f187922ddf6 is not an option. Figuring > out how a negative tsc_timestamp could be computed requires a bit more > sleuthing. > > In kvm_write_tsc() (at the time), except for KVM's "less than 1 second" > hack, KVM snapshotted the vCPU's current TSC *and* the current time in > nanoseconds, where kvm->arch.cur_tsc_nsec is the current host kernel time > in nanoseconds: > > ns = get_kernel_ns(); > > ... > > if (usdiff < USEC_PER_SEC && > vcpu->arch.virtual_tsc_khz == kvm->arch.last_tsc_khz) { > ... > } else { > /* > * We split periods of matched TSC writes into generations. > * For each generation, we track the original measured > * nanosecond time, offset, and write, so if TSCs are in > * sync, we can match exact offset, and if not, we can match > * exact software computation in compute_guest_tsc() > * > * These values are tracked in kvm->arch.cur_xxx variables. > */ > kvm->arch.cur_tsc_generation++; > kvm->arch.cur_tsc_nsec = ns; > kvm->arch.cur_tsc_write = data; > kvm->arch.cur_tsc_offset = offset; > matched = false; > pr_debug("kvm: new tsc generation %llu, clock %llu\n", > kvm->arch.cur_tsc_generation, data); > } > > ... > > /* Keep track of which generation this VCPU has synchronized to */ > vcpu->arch.this_tsc_generation = kvm->arch.cur_tsc_generation; > vcpu->arch.this_tsc_nsec = kvm->arch.cur_tsc_nsec; > vcpu->arch.this_tsc_write = kvm->arch.cur_tsc_write; > > Note that the above creates a new generation and sets "matched" to false! > But because kvm_track_tsc_matching() looks for matched+1, i.e. doesn't > require the vCPU that creates the new generation to match itself, KVM > would immediately compute vcpus_matched as true for VMs with a single vCPU. > As a result, KVM would skip the masterlock update, even though a new TSC > generation was created: > > vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 == > atomic_read(&vcpu->kvm->online_vcpus)); > > if (vcpus_matched && gtod->clock.vclock_mode == VCLOCK_TSC) > if (!ka->use_master_clock) > do_request = 1; > > if (!vcpus_matched && ka->use_master_clock) > do_request = 1; > > if (do_request) > kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu); > > On hardware without TSC scaling support, vcpu->tsc_catchup is set to true > if the guest TSC frequency is faster than the host TSC frequency, even if > the TSC is otherwise stable. And for that mode, kvm_guest_time_update(), > by way of compute_guest_tsc(), uses vcpu->arch.this_tsc_nsec, a.k.a. the > kernel time at the last TSC write, to compute the guest TSC relative to > kernel time: > > static u64 compute_guest_tsc(struct kvm_vcpu *vcpu, s64 kernel_ns) > { > u64 tsc = pvclock_scale_delta(kernel_ns-vcpu->arch.this_tsc_nsec, > vcpu->arch.virtual_tsc_mult, > vcpu->arch.virtual_tsc_shift); > tsc += vcpu->arch.this_tsc_write; > return tsc; > } > > Except the "kernel_ns" passed to compute_guest_tsc() isn't the current > kernel time, it's the masterclock snapshot! > > spin_lock(&ka->pvclock_gtod_sync_lock); > use_master_clock = ka->use_master_clock; > if (use_master_clock) { > host_tsc = ka->master_cycle_now; > kernel_ns = ka->master_kernel_ns; > } > spin_unlock(&ka->pvclock_gtod_sync_lock); > > if (vcpu->tsc_catchup) { > u64 tsc = compute_guest_tsc(v, kernel_ns); > if (tsc > tsc_timestamp) { > adjust_tsc_offset_guest(v, tsc - tsc_timestamp); > tsc_timestamp = tsc; > } > } > > And so when KVM skips the masterclock update after a TSC write, i.e. after > a new TSC generation is started, the "kernel_ns-vcpu->arch.this_tsc_nsec" > is *guaranteed* to generate a negative value, because this_tsc_nsec was > captured after ka->master_kernel_ns. So what? It *should* be negative, shouldn't it? I think the problem is how we're using that value, and what we're conflating it with. Let us consider the case where ka->use_master_clock is true, but we're manually upscaling the TSC in software so vcpu->tsc_catchup is also true. Let us postpone, for the moment, the question of whether we should even *let* use_master_clock become true in that case. There are a number of points in time which need to be considered: • vcpu->arch.this_tsc_nsec * kvm->arch.master_kernel_ns * The point in time "now" at which kvm_guest_time_update() is called. For any given point in time, compute_guest_tsc() should calculate the guest TSC at that moment, by scaling the elapsed microseconds since vcpu->arch.this_tsc_nsec to the guest TSC frequency and adding that to vcpu->arch.this_tsc_write. I say "should", because compute_guest_tsc() is currently buggy when asked to scale a *negative* number. Trivially fixable though. Now, let's look at what kvm_guest_time_update() is doing. It attempts to do two things. First it calculates the guest TSC at the reference point that it's putting into the pvclock structure. That's what needs to go into the 'tsc_timestamp' field of the pvclock structure alongside the corresponding KVM clock 'system_time' at 'kernel_ns'. In master clock mode, the value it uses for kernel_ns is ka->master_kernel_ns, and otherwise it is the current time.. It's perfectly reasonable for master_kernel_ns to be earlier in time than vcpu->this_tsc_nsec. That just means the TSC value we write to the pvclock ends up being lower than the value in vcpu->this_tsc_write, by an appropriate number of cycles. So as long as compute_guest_tsc() isn't buggy with negative numbers, that should all be fine. But there *is* a bug in kvm_guest_time_update(), I think... In tsc_catchup mode, simulating a TSC which runs faster than the host, the delta between host and guest TSCs gets larger and larger over time. That's why kvm_guest_time_update() is called *every* time the vCPU is entered, to adjust the TSC further and further every time. But currently, kvm_guest_time_update() only nudges the guest TSC as far forward as it should have been at master_kernel_ns. At any time later than master_kernel_ns, the delta should be even higher. I think compute_guest_tsc() should look something like this, to cope with the negativity: static u64 compute_guest_tsc(struct kvm_vcpu *vcpu, s64 kernel_ns) { s64 delta = kernel_ns - vcpu->arch.this_tsc_nsec; u64 tsc = vcpu->arch.this_tsc_write; /* pvclock_scale_delta cannot cope with negative deltas */ if (delta >= 0) tsc += pvclock_scale_delta(delta, vcpu->arch.virtual_tsc_mult, vcpu->arch.virtual_tsc_shift); else tsc -= pvclock_scale_delta(-delta, vcpu->arch.virtual_tsc_mult, vcpu->arch.virtual_tsc_shift); return tsc; } And the catchup code in kvm_guest_time_update() should correct *both* the reference time *and* the current TSC by *different* amounts, something like this: if (vcpu->tsc_catchup) { uint64_t now_guest_tsc_adjusted; uint64_t now_guest_tsc_unadjusted; int64_t now_guest_tsc_delta; tsc_timestamp = compute_guest_tsc(v, kernel_ns); if (use_master_clock) { uint64_t now_host_tsc; int64_t now_kernel_ns; if (!kvm_get_time_and_clockread(&now_kernel_ns, &now_host_tsc)) { now_kernel_ns = get_kvmclock_base_ns(); now_host_tsc = rdtsc(); } now_guest_tsc_adjusted = compute_guest_tsc(v, now_kernel_ns); now_guest_tsc_unadjusted = kvm_read_l1_tsc(v, now_host_tsc); } else { now_guest_tsc_adjusted = tsc_timestamp; now_guest_tsc_unadjusted = kvm_read_l1_tsc(v, kernel_ns); } now_guest_tsc_delta = now_guest_tsc_adjusted - now_guest_tsc_unadjusted; if (now_guest_tsc_delta > 0) adjust_tsc_offset_guest(v, now_guest_tsc_delta); } else { tsc_timestamp = kvm_read_l1_tsc(v, host_tsc); } Then we can drop that extra masterclock update in kvm_track_tsc_matching(), along with the comment that compute_guest_tsc() needs the masterclock snapshot to be newer. > Forcing a masterclock update essentially fudged around that problem, but > in a heavy handed way that introduced undesirable side effects, i.e. > unnecessarily forces a masterclock update when a new vCPU joins the party > via hotplug. > > Note, KVM forces masterclock updates in other weird ways that are also > likely unnecessary, e.g. when establishing a new Xen shared info page and > when userspace creates a brand new vCPU. But the Xen thing is firmly a > separate mess, and there are no known userspace VMMs that utilize kvmclock > *and* create new vCPUs after the VM is up and running. I.e. the other > issues are future problems. > > Reported-by: Dongli Zhang <dongli.zhang@xxxxxxxxxx> > Closes: https://lore.kernel.org/all/20230926230649.67852-1-dongli.zhang@xxxxxxxxxx > Fixes: 7f187922ddf6 ("KVM: x86: update masterclock values on TSC writes") > Cc: David Woodhouse <dwmw2@xxxxxxxxxxxxx> > Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx> > --- > arch/x86/kvm/x86.c | 29 ++++++++++++++++------------- > 1 file changed, 16 insertions(+), 13 deletions(-) > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 530d4bc2259b..61bdb6c1d000 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -2510,26 +2510,29 @@ static inline int gtod_is_based_on_tsc(int mode) > } > #endif > > -static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu) > +static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu, bool new_generation) > { > #ifdef CONFIG_X86_64 > - bool vcpus_matched; > struct kvm_arch *ka = &vcpu->kvm->arch; > struct pvclock_gtod_data *gtod = &pvclock_gtod_data; > > - vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 == > - atomic_read(&vcpu->kvm->online_vcpus)); > + /* > + * To use the masterclock, the host clocksource must be based on TSC > + * and all vCPUs must have matching TSCs. Note, the count for matching > + * vCPUs doesn't include the reference vCPU, hence "+1". > + */ > + bool use_master_clock = (ka->nr_vcpus_matched_tsc + 1 == > + atomic_read(&vcpu->kvm->online_vcpus)) && > + gtod_is_based_on_tsc(gtod->clock.vclock_mode); > > /* > - * Once the masterclock is enabled, always perform request in > - * order to update it. > - * > - * In order to enable masterclock, the host clocksource must be TSC > - * and the vcpus need to have matched TSCs. When that happens, > - * perform request to enable masterclock. > + * Request a masterclock update if the masterclock needs to be toggled > + * on/off, or when starting a new generation and the masterclock is > + * enabled (compute_guest_tsc() requires the masterclock snapshot to be > + * taken _after_ the new generation is created). > */ > - if (ka->use_master_clock || > - (gtod_is_based_on_tsc(gtod->clock.vclock_mode) && vcpus_matched)) > + if ((ka->use_master_clock && new_generation) || > + (ka->use_master_clock != use_master_clock)) > kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu); > > trace_kvm_track_tsc(vcpu->vcpu_id, ka->nr_vcpus_matched_tsc, > @@ -2706,7 +2709,7 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc, > vcpu->arch.this_tsc_nsec = kvm->arch.cur_tsc_nsec; > vcpu->arch.this_tsc_write = kvm->arch.cur_tsc_write; > > - kvm_track_tsc_matching(vcpu); > + kvm_track_tsc_matching(vcpu, !matched); > } > > static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 *user_value) > > base-commit: 437bba5ad2bba00c2056c896753a32edf80860cc
Attachment:
smime.p7s
Description: S/MIME cryptographic signature