Dropping a few people/lists whose emails are bouncing. On Fri, Jan 31, 2025, Sean Christopherson wrote: > @@ -369,6 +369,11 @@ void __init kvmclock_init(void) > #ifdef CONFIG_X86_LOCAL_APIC > x86_cpuinit.early_percpu_clock_init = kvm_setup_secondary_clock; > #endif > + /* > + * Save/restore "sched" clock state even if kvmclock isn't being used > + * for sched_clock, as kvmclock is still used for wallclock and relies > + * on these hooks to re-enable kvmclock after suspend+resume. This is wrong, wallclock is a different MSR entirely. > + */ > x86_platform.save_sched_clock_state = kvm_save_sched_clock_state; > x86_platform.restore_sched_clock_state = kvm_restore_sched_clock_state; And usurping sched_clock save/restore is *really* wrong if kvmclock isn't being used as sched_clock, because when TSC is reset on suspend/hiberation, not doing tsc_{save,restore}_sched_clock_state() results in time going haywire. Subtly, that issue goes all the way back to patch "x86/paravirt: Don't use a PV sched_clock in CoCo guests with trusted TSC" because pulling the rug out from under kvmclock leads to the same problem. The whole PV sched_clock scheme is a disaster. Hyper-V overrides the save/restore callbacks, but _also_ runs the old TSC callbacks, because Hyper-V doesn't ensure that it's actually using the Hyper-V clock for sched_clock. And the code is all kinds of funky, because it tries to keep the x86 code isolated from the generic HV clock code, but (a) there's already x86 PV specific code in drivers/clocksource/hyperv_timer.c, and (b) splitting the code means that Hyper-V overides the sched_clock save/restore hooks even when PARAVIRT=n, i.e. when HV clock can't possibly be used as sched_clock. VMware appears to be buggy and doesn't do have offset adjustments, and also lets the TSC callbacks run. I can't tell if Xen is broken, or if it's the sanest of the bunch. Xen does save/restore things a la kvmclock, but only in the Xen PV suspend path. So if the "normal" suspend/hibernate paths are unreachable, Xen is sane. If not, Xen is quite broken. To make matters worse, kvmclock is a mess and has existing bugs. The BSP's clock is disabled during syscore_suspend() (via kvm_suspend()), but only re-enabled in the sched_clock callback. So if suspend is aborted due to a late wakeup, the BSP will run without its clock enabled, which "works" only because KVM-the-hypervisor is kind enough to not clobber the shared memory when the clock is disabled. But over time, I would expect time on the BSP to drift from APs. And then there's this crud: #ifdef CONFIG_X86_LOCAL_APIC x86_cpuinit.early_percpu_clock_init = kvm_setup_secondary_clock; #endif which (a) should be guarded by CONFIG_SMP, not X86_LOCAL_APIC, and (b) is only actually needed when kvmclock is sched_clock, because timekeeping doesn't actually need to start that early. But of course kvmclock craptastic handling of suspend and resume makes untangling that more difficult than it needs to be. The icing on the cake is that after cleaning up all the hacks, and having kvmclock hook clocksource.suspend/resume like it should, suspend/resume under kvmclock corrupts wall clock time because timekeeping_resume() reads the persistent clock before resuming clocksource clocks, and the stupid kvmclock wall clock subtly consumes the clocksource/system clock. *sigh* I have yet more patches to clean all of this up. The series is rather unwieldly, as it's now sitting at 38 patches (ugh), but I don't see a way to chunk it up in a meaningful way, because everything is so intertwined. :-/