+PeterZ Thomas and Peter, We're trying to address an issue where KVM's paravirt kvmclock drifts from the host's TSC-based monotonic raw clock because of historical reasons (at least, AFAICT), even when the TSC is constant. Due to some dubious KVM behavior, KVM may sometimes re-sync kvmclock against the host's monotonic raw clock, which causes non-trivial jumps in time from the guest's perspective. Linux-as-a-guest demotes all paravirt clock sources when the TSC is constant and nonstop, and so the goofy KVM behavior isn't likely to affect the guest's clocksource, but the guest's sched_clock() implementation keeps using the paravirt clock. Irrespective of if/how we fix the KVM host-side mess, using a paravirt clock for the scheduler when using a constant, nonstop TSC for the clocksource seems at best inefficient, and at worst unnecessarily complex and risky. Is there any reason not to prefer native_sched_clock() over whatever paravirt clock is present when the TSC is the preferred clocksource? Assuming the desirable thing to do is to use native_sched_clock() in this scenario, do we need a separate rating system, or can we simply tie the sched clock selection to the clocksource selection, e.g. override the paravirt stuff if the TSC clock has higher priority and is chosen? Some more details below (and in the rest of the thread). Thanks! On Mon, Oct 02, 2023, Dongli Zhang wrote: > Hi Sean and David, > > On 10/2/23 09:37, Sean Christopherson wrote: > > However, why does any of this matter if the host has a constant TSC? If that's > > the case, a sane setup will expose a constant TSC to the guest and the guest will > > use the TSC instead of kvmclock for the guest clocksource. > > > > Dongli, is this for long-lived "legacy" guests that were created on hosts without > > a constant TSC? If not, then why is kvmclock being used? Or heaven forbid, are > > you running on hardware without a constant TSC? :-) > > This is for test guests, and the host has all of below: > > tsc, rdtscp, constant_tsc, nonstop_tsc, tsc_deadline_timer, tsc_adjust > > A clocksource is used for two things. > > > 1. current_clocksource. Yes, I agree we should always use tsc on modern hardware. > > Do we need to update the documentation to always suggest TSC when it is > constant, as I believe many users still prefer pv clock than tsc? > > Thanks to tsc ratio scaling, the live migration will not impact tsc. > > >From the source code, the rating of kvm-clock is still higher than tsc. > > BTW., how about to decrease the rating if guest detects constant tsc? > > 166 struct clocksource kvm_clock = { > 167 .name = "kvm-clock", > 168 .read = kvm_clock_get_cycles, > 169 .rating = 400, > 170 .mask = CLOCKSOURCE_MASK(64), > 171 .flags = CLOCK_SOURCE_IS_CONTINUOUS, > 172 .enable = kvm_cs_enable, > 173 }; > > 1196 static struct clocksource clocksource_tsc = { > 1197 .name = "tsc", > 1198 .rating = 300, > 1199 .read = read_tsc, That's already done in kvmclock_init(). if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) && boot_cpu_has(X86_FEATURE_NONSTOP_TSC) && !check_tsc_unstable()) kvm_clock.rating = 299; See also: https://lore.kernel.org/all/ZOjF2DMBgW%2FzVvL3@xxxxxxxxxx > 2. The sched_clock. > > The scheduling is impacted if there is big drift. ... > Unfortunately, the "no-kvmclock" kernel parameter disables all pv clock > operations (not only sched_clock), e.g., after line 300. ... > Should I introduce a new param to disable no-kvm-sched-clock only, or to > introduce a new param to allow the selection of sched_clock? I don't think we want a KVM-specific knob, because every flavor of paravirt guest would need to do the same thing. And unless there's a good reason to use a paravirt clock, this really shouldn't be something the guest admin needs to opt into using.