Hi Sean and David, On 10/2/23 09:37, Sean Christopherson wrote: > On Mon, Oct 02, 2023, David Woodhouse wrote: >> On Fri, 2023-09-29 at 13:15 -0700, Dongli Zhang wrote: >>> >>> >>> We want more frequent KVM_REQ_MASTERCLOCK_UPDATE. >>> >>> This is because: >>> >>> 1. The vcpu->hv_clock (kvmclock) is based on its own mult/shift/equation. >>> >>> 2. The raw monotonic (tsc_clocksource) uses different mult/shift/equation. >>> >>> 3. As a result, given the same rdtsc, kvmclock and raw monotonic may return >>> different results (this is expected because they have different >>> mult/shift/equation). >>> >>> 4. However, the base in kvmclock calculation (tsc_timestamp and system_time) >>> are derived from raw monotonic clock (master clock) >> >> That just seems wrong. I don't mean that you're incorrect; it seems >> *morally* wrong. >> >> In a system with X86_FEATURE_CONSTANT_TSC, why would KVM choose to use >> a *different* mult/shift/equation (your #1) to convert TSC ticks to >> nanoseconds than the host CLOCK_MONOTONIC_RAW does (your #2). >> >> I understand that KVM can't track the host's CLOCK_MONOTONIC, as it's >> adjusted by NTP. But CLOCK_MONOTONIC_RAW is supposed to be consistent. >> >> Fix that, and the whole problem goes away, doesn't it? >> >> What am I missing here, that means we can't do that? > > I believe the answer is that "struct pvclock_vcpu_time_info" and its math are > ABI between KVM and KVM guests. > > Like many of the older bits of KVM, my guess is that KVM's behavior is the product > of making things kinda sorta work with old hardware, i.e. was probably the least > awful solution in the days before constant TSCs, but is completely nonsensical on > modern hardware. > >> Alternatively... with X86_FEATURE_CONSTANT_TSC, why do the sync at all? >> If KVM wants to decide that the TSC runs at a different frequency to >> the frequency that the host uses for CLOCK_MONOTONIC_RAW, why can't KVM >> just *stick* to that? > > Yeah, bouncing around guest time when the TSC is constant seems counterproductive. > > However, why does any of this matter if the host has a constant TSC? If that's > the case, a sane setup will expose a constant TSC to the guest and the guest will > use the TSC instead of kvmclock for the guest clocksource. > > Dongli, is this for long-lived "legacy" guests that were created on hosts without > a constant TSC? If not, then why is kvmclock being used? Or heaven forbid, are > you running on hardware without a constant TSC? :-) This is for test guests, and the host has all of below: tsc, rdtscp, constant_tsc, nonstop_tsc, tsc_deadline_timer, tsc_adjust A clocksource is used for two things. 1. current_clocksource. Yes, I agree we should always use tsc on modern hardware. Do we need to update the documentation to always suggest TSC when it is constant, as I believe many users still prefer pv clock than tsc? Thanks to tsc ratio scaling, the live migration will not impact tsc. >From the source code, the rating of kvm-clock is still higher than tsc. BTW., how about to decrease the rating if guest detects constant tsc? 166 struct clocksource kvm_clock = { 167 .name = "kvm-clock", 168 .read = kvm_clock_get_cycles, 169 .rating = 400, 170 .mask = CLOCKSOURCE_MASK(64), 171 .flags = CLOCK_SOURCE_IS_CONTINUOUS, 172 .enable = kvm_cs_enable, 173 }; 1196 static struct clocksource clocksource_tsc = { 1197 .name = "tsc", 1198 .rating = 300, 1199 .read = read_tsc, 2. The sched_clock. The scheduling is impacted if there is big drift. Fortunately, according to my general test (without RT requirement), the impact is trivial unless the two masterclock *updates* are between a super long period. In the past, the scheduling was impacted a lot when the masterclock was still based on monotonic (not raw). https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=53fafdbb8b21fa99dfd8376ca056bffde8cafc11 Unfortunately, the "no-kvmclock" kernel parameter disables all pv clock operations (not only sched_clock), e.g., after line 300. 296 void __init kvmclock_init(void) 297 { 298 u8 flags; 299 300 if (!kvm_para_available() || !kvmclock) 301 return; 302 303 if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2)) { 304 msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW; 305 msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW; 306 } else if (!kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)) { 307 return; 308 } 309 310 if (cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "kvmclock:setup_percpu", 311 kvmclock_setup_percpu, NULL) < 0) { 312 return; 313 } 314 315 pr_info("kvm-clock: Using msrs %x and %x", 316 msr_kvm_system_time, msr_kvm_wall_clock); 317 318 this_cpu_write(hv_clock_per_cpu, &hv_clock_boot[0]); 319 kvm_register_clock("primary cpu clock"); 320 pvclock_set_pvti_cpu0_va(hv_clock_boot); 321 322 if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT)) 323 pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT); 324 325 flags = pvclock_read_flags(&hv_clock_boot[0].pvti); 326 kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT); 327 328 x86_platform.calibrate_tsc = kvm_get_tsc_khz; 329 x86_platform.calibrate_cpu = kvm_get_tsc_khz; 330 x86_platform.get_wallclock = kvm_get_wallclock; 331 x86_platform.set_wallclock = kvm_set_wallclock; 332 #ifdef CONFIG_X86_LOCAL_APIC 333 x86_cpuinit.early_percpu_clock_init = kvm_setup_secondary_clock; 334 #endif 335 x86_platform.save_sched_clock_state = kvm_save_sched_clock_state; 336 x86_platform.restore_sched_clock_state = kvm_restore_sched_clock_state; 337 kvm_get_preset_lpj(); Should I introduce a new param to disable no-kvm-sched-clock only, or to introduce a new param to allow the selection of sched_clock? Those may not resolve the issue in another thread. (I guess there is a chance that to refresh the masterclock may reduce the drift in that case, although never tried) https://lore.kernel.org/all/00fba193-238e-49dc-fdc4-0b93f20569ec@xxxxxxxxxx/ Thank you very much! Dongli Zhang > > Not saying we shouldn't sanitize the kvmclock behavior, but knowing the exact > problematic configuration(s) will help us make a better decision on how to fix > the mess.