Hi Sean, On 10/4/23 11:06 AM, Sean Christopherson wrote: > On Wed, Oct 04, 2023, David Woodhouse wrote: >> On Tue, 2023-10-03 at 17:04 -0700, Sean Christopherson wrote: >>>> Can't we ensure that the kvmclock uses the *same* algorithm, >>>> precisely, as CLOCK_MONOTONIC_RAW? >>> >>> Yes? At least for sane hardware, after much staring, I think it's possible. >>> >>> It's tricky because the two algorithms are wierdly different, the PV clock algorithm >>> is ABI and thus immutable, and Thomas and the timekeeping folks would rightly laugh >>> at us for suggesting that we try to shove the pvclock algorithm into the kernel. >>> >>> The hardcoded shift right 32 in PV clock is annoying, but not the end of the world. >>> >>> Compile tested only, but I believe this math is correct. And I'm guessing we'd >>> want some safeguards against overflow, e.g. due to a multiplier that is too big. >>> >>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c >>> index 6573c89c35a9..ae9275c3d580 100644 >>> --- a/arch/x86/kvm/x86.c >>> +++ b/arch/x86/kvm/x86.c >>> @@ -3212,9 +3212,19 @@ static int kvm_guest_time_update(struct kvm_vcpu *v) >>> v->arch.l1_tsc_scaling_ratio); >>> >>> if (unlikely(vcpu->hw_tsc_khz != tgt_tsc_khz)) { >>> - kvm_get_time_scale(NSEC_PER_SEC, tgt_tsc_khz * 1000LL, >>> - &vcpu->hv_clock.tsc_shift, >>> - &vcpu->hv_clock.tsc_to_system_mul); >>> + u32 shift, mult; >>> + >>> + clocks_calc_mult_shift(&mult, &shift, tgt_tsc_khz, NSEC_PER_MSEC, 600); >>> + >>> + if (shift <= 32) { >>> + vcpu->hv_clock.tsc_shift = 0; >>> + vcpu->hv_clock.tsc_to_system_mul = mult * BIT(32 - shift); >>> + } else { >>> + kvm_get_time_scale(NSEC_PER_SEC, tgt_tsc_khz * 1000LL, >>> + &vcpu->hv_clock.tsc_shift, >>> + &vcpu->hv_clock.tsc_to_system_mul); >>> + } >>> + >>> vcpu->hw_tsc_khz = tgt_tsc_khz; >>> kvm_xen_update_tsc_info(v); >>> } >>> >> >> I gave that a go on my test box, and for a TSC frequency of 2593992 kHz >> it got mult=1655736523, shift=32 and took the 'happy' path instead of >> falling back. >> >> It still drifts about the same though, using the same test as before: >> https://urldefense.com/v3/__https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/kvmclock__;!!ACWV5N9M2RV99hQ!KEdDuRZIThXoz2zaZd3O9rk77ywSaHCQ92fTnc7PFP81bdOhTMvudMBReIfZcrm9AITeKw4kyMTmbPbJuA$ >> >> >> I was going to facetiously suggest that perhaps the kvmclock should >> have leap nanoseconds... but then realised that that's basically what >> Dongli's patch is *doing*. Maybe we just need to *recognise* that, > > Yeah, I suspect trying to get kvmclock to always precisely align with the kernel's > monotonic raw clock is a fool's errand. > >> so rather than having a user-configured period for the update, KVM could >> calculate the frequency for the updates based on the rate at which the clocks >> would otherwise drift, and a maximum delta? Not my favourite option, but >> perhaps better than nothing? > > Holy moly, the existing code for the periodic syncs/updates is a mess. If I'm > reading the code correctly, commits > > 0061d53daf26 ("KVM: x86: limit difference between kvmclock updates") > 7e44e4495a39 ("x86: kvm: rate-limit global clock updates") > 332967a3eac0 ("x86: kvm: introduce periodic global clock updates") > > splattered together an immpressively inefficient update mechanism. > > On the first vCPU creation, KVM schedules kvmclock_sync_fn() at a hardcoded rate > of 300hz. > > if (kvmclock_periodic_sync && vcpu->vcpu_idx == 0) > schedule_delayed_work(&kvm->arch.kvmclock_sync_work, > KVMCLOCK_SYNC_PERIOD); > > That handler does two things: schedule "delayed" work kvmclock_update_fn() to > be executed immediately, and reschedule kvmclock_sync_fn() at 300hz. > kvmclock_sync_fn() then kicks *every* vCPU in the VM, i.e. KVM kicks every vCPU > to sync kvmlock at a 300hz frequency. > > If we're going to kick every vCPU, then we might as well do a masterclock update, > because the extra cost of synchronizing the masterclock is likely in the noise > compared to kicking every vCPU. There's also zero reason to do the work in vCPU > context. > > And because that's not enough, on pCPU migration or if the TSC is unstable, > kvm_arch_vcpu_load() requests KVM_REQ_GLOBAL_CLOCK_UPDATE, which schedules > kvmclock_update_fn() with a delay of 100ms. The large delay is to play nice with > unstable TSCs. But if KVM is periodically doing clock updates on all vCPU, > scheduling another update with a *longer* delay is silly. We may need to add above message to the places, where KVM_REQ_GLOBAL_CLOCK_UPDATE is replaced with KVM_REQ_CLOCK_UPDATE in the patch? This helps understand why KVM_REQ_CLOCK_UPDATE is sometime enough. > > The really, really stupid part of all is that the periodic syncs happen even if > kvmclock isn't exposed to the guest. *sigh* > > So rather than add yet another periodic work function, I think we should clean up > the mess we have, fix the whole "leapseconds" mess with the masterclock, and then > tune the frequency (if necessary). > > Something like the below is what I'm thinking. Once the dust settles, I'd like > to do dynamically enable/disable kvmclock_sync_work based on whether or not the > VM actually has vCPU's with a pvclock, but that's definitely an enhancement that > can go on top. > > Does this look sane, or am I missing something? > > --- > arch/x86/include/asm/kvm_host.h | 3 +- > arch/x86/kvm/x86.c | 53 +++++++++++---------------------- > 2 files changed, 19 insertions(+), 37 deletions(-) > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index 34a64527654c..d108452fc301 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -98,7 +98,7 @@ > KVM_ARCH_REQ_FLAGS(14, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP) > #define KVM_REQ_SCAN_IOAPIC \ > KVM_ARCH_REQ_FLAGS(15, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP) > -#define KVM_REQ_GLOBAL_CLOCK_UPDATE KVM_ARCH_REQ(16) > +/* AVAILABLE BIT!!!! KVM_ARCH_REQ(16) */ > #define KVM_REQ_APIC_PAGE_RELOAD \ > KVM_ARCH_REQ_FLAGS(17, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP) > #define KVM_REQ_HV_CRASH KVM_ARCH_REQ(18) > @@ -1336,7 +1336,6 @@ struct kvm_arch { > bool use_master_clock; > u64 master_kernel_ns; > u64 master_cycle_now; > - struct delayed_work kvmclock_update_work; > struct delayed_work kvmclock_sync_work; > > struct kvm_xen_hvm_config xen_hvm_config; > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 6573c89c35a9..5d35724f1963 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -2367,7 +2367,7 @@ static void kvm_write_system_time(struct kvm_vcpu *vcpu, gpa_t system_time, > } > > vcpu->arch.time = system_time; > - kvm_make_request(KVM_REQ_GLOBAL_CLOCK_UPDATE, vcpu); > + kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu); As mentioned above, we may need a comment here to explain why KVM_REQ_CLOCK_UPDATE on the only vcpu is enough. > > /* we verify if the enable bit is set... */ > if (system_time & 1) > @@ -3257,30 +3257,6 @@ static int kvm_guest_time_update(struct kvm_vcpu *v) > > #define KVMCLOCK_UPDATE_DELAY msecs_to_jiffies(100) > > -static void kvmclock_update_fn(struct work_struct *work) > -{ > - unsigned long i; > - struct delayed_work *dwork = to_delayed_work(work); > - struct kvm_arch *ka = container_of(dwork, struct kvm_arch, > - kvmclock_update_work); > - struct kvm *kvm = container_of(ka, struct kvm, arch); > - struct kvm_vcpu *vcpu; > - > - kvm_for_each_vcpu(i, vcpu, kvm) { > - kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu); > - kvm_vcpu_kick(vcpu); > - } > -} > - > -static void kvm_gen_kvmclock_update(struct kvm_vcpu *v) > -{ > - struct kvm *kvm = v->kvm; > - > - kvm_make_request(KVM_REQ_CLOCK_UPDATE, v); > - schedule_delayed_work(&kvm->arch.kvmclock_update_work, > - KVMCLOCK_UPDATE_DELAY); > -} > - > #define KVMCLOCK_SYNC_PERIOD (300 * HZ) While David mentioned "maximum delta", how about to turn above into a module param with the default 300HZ. BTW, 300HZ should be enough for vCPU hotplug case, unless people prefer 1-hour or 1-day. > > static void kvmclock_sync_fn(struct work_struct *work) > @@ -3290,12 +3266,14 @@ static void kvmclock_sync_fn(struct work_struct *work) > kvmclock_sync_work); > struct kvm *kvm = container_of(ka, struct kvm, arch); > > - if (!kvmclock_periodic_sync) > - return; > + if (ka->use_master_clock) > + kvm_update_masterclock(kvm); Based on the source code, I think it is safe to call kvm_update_masterclock() here. We want the masterclock to update only once. To call KVM_REQ_MASTERCLOCK_UPDATE for each vCPU here is meaningless. I just want to remind that this is shared workqueue. The workqueue stall detection may report false positive (e.g., due to tsc_write_lock contention. That should not be lock contensive). Thank you very much! Dongli Zhang > + else > + kvm_make_all_cpus_request(kvm, KVM_REQ_CLOCK_UPDATE); > > - schedule_delayed_work(&kvm->arch.kvmclock_update_work, 0); > - schedule_delayed_work(&kvm->arch.kvmclock_sync_work, > - KVMCLOCK_SYNC_PERIOD); > + if (kvmclock_periodic_sync) > + schedule_delayed_work(&kvm->arch.kvmclock_sync_work, > + KVMCLOCK_SYNC_PERIOD); > } > > /* These helpers are safe iff @msr is known to be an MCx bank MSR. */ > @@ -4845,7 +4823,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) > * kvmclock on vcpu->cpu migration > */ > if (!vcpu->kvm->arch.use_master_clock || vcpu->cpu == -1) > - kvm_make_request(KVM_REQ_GLOBAL_CLOCK_UPDATE, vcpu); > + kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu); > if (vcpu->cpu != cpu) > kvm_make_request(KVM_REQ_MIGRATE_TIMER, vcpu); > vcpu->cpu = cpu; > @@ -10520,12 +10498,19 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) > __kvm_migrate_timers(vcpu); > if (kvm_check_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu)) > kvm_update_masterclock(vcpu->kvm); > - if (kvm_check_request(KVM_REQ_GLOBAL_CLOCK_UPDATE, vcpu)) > - kvm_gen_kvmclock_update(vcpu); > if (kvm_check_request(KVM_REQ_CLOCK_UPDATE, vcpu)) { > r = kvm_guest_time_update(vcpu); > if (unlikely(r)) > goto out; > + > + /* > + * Ensure all other vCPUs synchronize "soon", e.g. so > + * that all vCPUs recognize NTP corrections and drift > + * corrections (relative to the kernel's raw clock). > + */ > + if (!kvmclock_periodic_sync) > + schedule_delayed_work(&vcpu->kvm->arch.kvmclock_sync_work, > + KVMCLOCK_UPDATE_DELAY); > } > if (kvm_check_request(KVM_REQ_MMU_SYNC, vcpu)) > kvm_mmu_sync_roots(vcpu); > @@ -12345,7 +12330,6 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type) > kvm->arch.hv_root_tdp = INVALID_PAGE; > #endif > > - INIT_DELAYED_WORK(&kvm->arch.kvmclock_update_work, kvmclock_update_fn); > INIT_DELAYED_WORK(&kvm->arch.kvmclock_sync_work, kvmclock_sync_fn); > > kvm_apicv_init(kvm); > @@ -12387,7 +12371,6 @@ static void kvm_unload_vcpu_mmus(struct kvm *kvm) > void kvm_arch_sync_events(struct kvm *kvm) > { > cancel_delayed_work_sync(&kvm->arch.kvmclock_sync_work); > - cancel_delayed_work_sync(&kvm->arch.kvmclock_update_work); > kvm_free_pit(kvm); > } > > > base-commit: e2c8c2928d93f64b976b9242ddb08684b8cdea8d