From: Sean Christopherson <seanjc@xxxxxxxxxx> Sent: Friday, February 7, 2025 9:23 AM > > Dropping a few people/lists whose emails are bouncing. > > On Fri, Jan 31, 2025, Sean Christopherson wrote: > > @@ -369,6 +369,11 @@ void __init kvmclock_init(void) > > #ifdef CONFIG_X86_LOCAL_APIC > > x86_cpuinit.early_percpu_clock_init = kvm_setup_secondary_clock; > > #endif > > + /* > > + * Save/restore "sched" clock state even if kvmclock isn't being used > > + * for sched_clock, as kvmclock is still used for wallclock and relies > > + * on these hooks to re-enable kvmclock after suspend+resume. > > This is wrong, wallclock is a different MSR entirely. > > > + */ > > x86_platform.save_sched_clock_state = kvm_save_sched_clock_state; > > x86_platform.restore_sched_clock_state = kvm_restore_sched_clock_state; > > And usurping sched_clock save/restore is *really* wrong if kvmclock isn't being > used as sched_clock, because when TSC is reset on suspend/hiberation, not doing > tsc_{save,restore}_sched_clock_state() results in time going haywire. > > Subtly, that issue goes all the way back to patch "x86/paravirt: Don't use a PV > sched_clock in CoCo guests with trusted TSC" because pulling the rug out from > under kvmclock leads to the same problem. > > The whole PV sched_clock scheme is a disaster. > > Hyper-V overrides the save/restore callbacks, but _also_ runs the old TSC callbacks, > because Hyper-V doesn't ensure that it's actually using the Hyper-V clock for > sched_clock. And the code is all kinds of funky, because it tries to keep the > x86 code isolated from the generic HV clock code, but (a) there's already x86 PV > specific code in drivers/clocksource/hyperv_timer.c, and (b) splitting the code > means that Hyper-V overides the sched_clock save/restore hooks even when > PARAVIRT=n, i.e. when HV clock can't possibly be used as sched_clock. Regarding (a), the one occurrence of x86 PV-specific code hyperv_timer.c is the call to paravirt_set_sched_clock(), and it's under an #ifdef sequence so that it's not built if targeting some other architecture. Or do you see something else that is x86-specific? Regarding (b), in drivers/hv/Kconfig, CONFIG_HYPERV always selects PARAVIRT. So the #else clause (where PARAVIRT=n) in that #ifdef sequence could arguably have a BUILD_BUG() added. If I recall correctly, other Hyper-V stuff breaks if PARAVIRT is forced to "n". So I don't think there's a current problem with the sched_clock save/restore hooks. But I would be good with some restructuring so that setting the sched clock save/restore hooks is more closely tied to the sched clock choice, as long as the architecture independence of hyperv_timer.c is preserved. And maybe there's a better way to handle hv_setup_sched_clock() that is less messy with the #ifdef's. I'll think about that too. Michael > > VMware appears to be buggy and doesn't do have offset adjustments, and also lets > the TSC callbacks run. > > I can't tell if Xen is broken, or if it's the sanest of the bunch. Xen does > save/restore things a la kvmclock, but only in the Xen PV suspend path. So if > the "normal" suspend/hibernate paths are unreachable, Xen is sane. If not, Xen > is quite broken. > > To make matters worse, kvmclock is a mess and has existing bugs. The BSP's clock > is disabled during syscore_suspend() (via kvm_suspend()), but only re-enabled in > the sched_clock callback. So if suspend is aborted due to a late wakeup, the BSP > will run without its clock enabled, which "works" only because KVM-the-hypervisor > is kind enough to not clobber the shared memory when the clock is disabled. But > over time, I would expect time on the BSP to drift from APs. > > And then there's this crud: > > #ifdef CONFIG_X86_LOCAL_APIC > x86_cpuinit.early_percpu_clock_init = kvm_setup_secondary_clock; > #endif > > which (a) should be guarded by CONFIG_SMP, not X86_LOCAL_APIC, and (b) is only > actually needed when kvmclock is sched_clock, because timekeeping doesn't actually > need to start that early. But of course kvmclock craptastic handling of suspend > and resume makes untangling that more difficult than it needs to be. > > The icing on the cake is that after cleaning up all the hacks, and having > kvmclock hook clocksource.suspend/resume like it should, suspend/resume under > kvmclock corrupts wall clock time because timekeeping_resume() reads the persistent > clock before resuming clocksource clocks, and the stupid kvmclock wall clock subtly > consumes the clocksource/system clock. *sigh* > > I have yet more patches to clean all of this up. The series is rather unwieldly, > as it's now sitting at 38 patches (ugh), but I don't see a way to chunk it up in > a meaningful way, because everything is so intertwined. :-/