On 10/9/2024 9:28 PM, Sean Christopherson wrote: > On Wed, Oct 09, 2024, Nikunj A Dadhania wrote: >> Although the kernel switches over to stable TSC clocksource instead of >> kvmclock, the scheduler still keeps on using kvmclock as the sched clock. >> This is due to kvm_sched_clock_init() updating the pv_sched_clock() >> unconditionally. > > All PV clocks are affected by this, no? There are two things that we are trying to associate with a registered PV clocksource and a PV sched_clock override provided by that PV. Looking at the code of various x86 PVs a) HyperV does not override the sched clock when the TSC_INVARIANT feature is set. It implements something similar to calling kvm_sched_clock_init() only when tsc is not stable [1] b) VMWare: Exports a reliable TSC to the guest. Does not register a clocksource. Overrides the pv_sched_clock with its own version that is using rdtsc(). c) Xen: Overrides the pv_sched_clock. The xen registers its own clocksource. It has same problem like KVM, pv_sched_clock is not switched back to native_sched_clock() Effectively, KVM, Xen and HyperV(when TSC invariant is not available) can be handled in the manner similar to this patch by registering a callback to override/restore the pv_sched_clock when the corresponding clocksource is chosen as the default clocksource. However, since VMWare only wants to override the pv_sched_clock without registering a PV clocksource, I will need to give some more thought to it as there is no callback available in this case. > This seems like something that should > be handled in common code, which is the point I was trying to make in v11. Let me think about it if this can be handled in common clocksource code. We will also need to look at how other archs are using this. > >> Use the clock source enable/disable callbacks to initialize >> kvm_sched_clock_init() and update the pv_sched_clock(). >> >> As the clock selection happens in the stop machine context, schedule >> delayed work to update the static_call() >> >> Signed-off-by: Nikunj A Dadhania <nikunj@xxxxxxx> >> --- >> arch/x86/kernel/kvmclock.c | 34 +++++++++++++++++++++++++++++----- >> 1 file changed, 29 insertions(+), 5 deletions(-) >> >> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c >> index 5b2c15214a6b..5cd3717e103b 100644 >> --- a/arch/x86/kernel/kvmclock.c >> +++ b/arch/x86/kernel/kvmclock.c >> @@ -21,6 +21,7 @@ >> #include <asm/hypervisor.h> >> #include <asm/x86_init.h> >> #include <asm/kvmclock.h> >> +#include <asm/timer.h> >> >> static int kvmclock __initdata = 1; >> static int kvmclock_vsyscall __initdata = 1; >> @@ -148,12 +149,39 @@ bool kvm_check_and_clear_guest_paused(void) >> return ret; >> } >> >> +static u64 (*old_pv_sched_clock)(void); >> + >> +static void enable_kvm_sc_work(struct work_struct *work) >> +{ >> + u8 flags; >> + >> + old_pv_sched_clock = static_call_query(pv_sched_clock); >> + flags = pvclock_read_flags(&hv_clock_boot[0].pvti); >> + kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT); >> +} >> + >> +static DECLARE_DELAYED_WORK(enable_kvm_sc, enable_kvm_sc_work); >> + >> +static void disable_kvm_sc_work(struct work_struct *work) >> +{ >> + if (old_pv_sched_clock) > > This feels like it should be a WARN condition, as IIUC, pv_sched_clock() should > never be null. And it _looks_ wrong too, as it means kvm_clock will remain the > sched clock if there was no old clock, which should be impossible. Makes sense, I will add a WARN_ON to catch this condition. > >> + paravirt_set_sched_clock(old_pv_sched_clock); Regards Nikunj 1. https://lore.kernel.org/lkml/ef194c25-22d8-204e-ffb6-8f9f0a0621fb@xxxxxxx/