On 9/25/2024 6:25 PM, Sean Christopherson wrote: > On Wed, Sep 25, 2024, Nikunj A. Dadhania wrote: >>>>>>> Are you suggesting that whenever the guest is either SNP or TDX, kvmclock >>>>>>> should be disabled assuming that timesource is stable and always running? >>>>>> >>>>>> No, I'm saying that the guest should prefer the raw TSC over kvmclock if the TSC >>>>>> is stable, irrespective of SNP or TDX. This is effectively already done for the >>>>>> timekeeping base (see commit 7539b174aef4 ("x86: kvmguest: use TSC clocksource if >>>>>> invariant TSC is exposed")), but the scheduler still uses kvmclock thanks to the >>>>>> kvm_sched_clock_init() code. >>>>> >>>>> The kvm-clock and tsc-early both are having the rating of 299. As they are of >>>>> same rating, kvm-clock is being picked up first. >>>>> >>>>> Is it fine to drop the clock rating of kvmclock to 298 ? With this tsc-early will >>>>> be picked up instead. >>>> >>>> IMO, it's ugly, but that's a problem with the rating system inasmuch as anything. >>>> >>>> But the kernel will still be using kvmclock for the scheduler clock, which is >>>> undesirable. >>> >>> Agree, kvm_sched_clock_init() is still being called. The above hunk was to use >>> tsc-early/tsc as the clocksource and not kvm-clock. >> >> How about the below patch: >> >> From: Nikunj A Dadhania <nikunj@xxxxxxx> >> Date: Tue, 28 Nov 2023 18:29:56 +0530 >> Subject: [RFC PATCH] x86/kvmclock: Prefer invariant TSC as the clocksource and >> scheduler clock >> >> For platforms that support stable and always running TSC, although the >> kvm-clock rating is dropped to 299 to prefer TSC, the guest scheduler clock >> still keeps on using the kvm-clock which is undesirable. Moreover, as the >> kvm-clock and early-tsc clocksource are both registered with 299 rating, >> kvm-clock is being picked up momentarily instead of selecting more stable >> tsc-early clocksource. >> >> kvm-clock: Using msrs 4b564d01 and 4b564d00 >> kvm-clock: using sched offset of 1799357702246960 cycles >> clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns >> tsc: Detected 1996.249 MHz processor >> clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x398cadd9d93, max_idle_ns: 881590552906 ns >> clocksource: Switched to clocksource kvm-clock >> clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x398cadd9d93, max_idle_ns: 881590552906 ns >> clocksource: Switched to clocksource tsc >> >> Drop the kvm-clock rating to 298, so that tsc-early is picked up before >> kvm-clock and use TSC for scheduler clock as well when the TSC is invariant >> and stable. >> >> Signed-off-by: Nikunj A Dadhania <nikunj@xxxxxxx> >> >> --- >> >> The issue we see here is that on bare-metal if the TSC is marked unstable, >> then the sched-clock will fall back to jiffies. In the virtualization case, >> do we want to fall back to kvm-clock when TSC is marked unstable? > > In the general case, yes. Though that might be a WARN-able offense if the TSC > is allegedly constant+nonstop. And for SNP and TDX, it might be a "panic and do > not boot" offense, since using kvmclock undermines the security of the guest. > >> --- >> arch/x86/kernel/kvmclock.c | 11 ++++++----- >> 1 file changed, 6 insertions(+), 5 deletions(-) >> >> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c >> index 5b2c15214a6b..c997b2628c4b 100644 >> --- a/arch/x86/kernel/kvmclock.c >> +++ b/arch/x86/kernel/kvmclock.c >> @@ -317,9 +317,6 @@ void __init kvmclock_init(void) >> if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT)) >> pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT); >> >> - flags = pvclock_read_flags(&hv_clock_boot[0].pvti); >> - kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT); >> - >> x86_platform.calibrate_tsc = kvm_get_tsc_khz; >> x86_platform.calibrate_cpu = kvm_get_tsc_khz; >> x86_platform.get_wallclock = kvm_get_wallclock; >> @@ -341,8 +338,12 @@ void __init kvmclock_init(void) >> */ >> if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) && >> boot_cpu_has(X86_FEATURE_NONSTOP_TSC) && >> - !check_tsc_unstable()) >> - kvm_clock.rating = 299; >> + !check_tsc_unstable()) { >> + kvm_clock.rating = 298; >> + } else { >> + flags = pvclock_read_flags(&hv_clock_boot[0].pvti); >> + kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT); >> + } > > I would really, really like to fix this in a centralized location, not by having > each PV clocksource muck with their clock's rating. TSC Clock Rating Adjustment: * During TSC initialization, downgrade the TSC clock rating to 200 if TSC is not constant/reliable, placing it below HPET. * Ensure the kvm-clock rating is set to 299 by default in the struct clocksource kvm_clock. * Avoid changing the kvm clock rating based on the availability of reliable clock sources. Let the TSC clock source determine and downgrade itself. The above will make sure that the PV clocksource rating remain unaffected. Clock soure selection order when the ratings match: * Currently, clocks are registered and enqueued based on their rating. * When clock ratings are tied, use the advertised clock frequency(freq_khz) as a secondary key to favor clocks with better frequency. This approach improves the selection process by considering both rating and frequency. Something like below: diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c index d0538a75f4c6..591451ccc0fa 100644 --- a/kernel/time/clocksource.c +++ b/kernel/time/clocksource.c @@ -1098,6 +1098,9 @@ static void clocksource_enqueue(struct clocksource *cs) /* Keep track of the place, where to insert */ if (tmp->rating < cs->rating) break; + if (tmp->rating == cs->rating && tmp->freq_khz < cs->freq_khz) + break; + entry = &tmp->list; } list_add(&cs->list, entry); @@ -1133,6 +1136,9 @@ void __clocksource_update_freq_scale(struct clocksource *cs, u32 scale, u32 freq * clocksource with mask >= 40-bit and f >= 4GHz. That maps to * ~ 0.06ppm granularity for NTP. */ + + cs->freq_khz = freq * scale / 1000; + sec = cs->mask; do_div(sec, freq); do_div(sec, scale); > I'm not even sure the existing > code is entirely correct, as kvmclock_init() runs _before_ tsc_early_init(). Which > is desirable in the legacy case, as it allows calibrating the TSC using kvmclock, > > x86_platform.calibrate_tsc = kvm_get_tsc_khz; > > but on modern setups that's definitely undesirable, as it means the kernel won't > use CPUID.0x15, which every explicitly tells software the frequency of the TSC. > > And I don't think we want to simply point at native_calibrate_tsc(), because that > thing is not at all correct for a VM, where checking x86_vendor and x86_vfm is at > best sketchy. > > E.g. I would think it's in AMD's interest for Secure TSC to define > the TSC frequency using CPUID.0x15, even if AMD CPUs don't (yet) natively support > CPUID.0x15. For SecureTSC: GUEST_TSC_FREQ MSR (C001_0134h) provides the TSC frequency. > In other words, I think we need to overhaul the PV clock vs. TSC logic so that it > makes sense for modern CPUs+VMs, not just keep hacking away at kvmclock. I don't > expect the code would be all that complex in the end, the hardest part is likely > just figuring out (and agreeing on) what exactly the kernel should be doing. To summarise this thread with respect to TSC vs KVM clock, there three key questions: 1) When should kvmclock init be done? 2) How should the TSC frequency be discovered? 3) What should be the sched clock source and how should it be selected in a generic way? ○ Legacy CPU/VMs: VMs running on platforms without non-stop/constant TSC + kvm-clock should be registered before tsc-early/tsc + Need to calibrate TSC frequency + Use kvmclock wallclock + Use kvmclock for sched clock selected dynamicaly (using clocksource enable()/disable() callback) ○ Modern CPU/VMs: VMs running on platforms supporting constant, non-stop and reliable TSC + kvm-clock should be registered before tsc-early/tsc + TSC Frequency: Intel: TSC frequency using CPUID 0x15H/ 0x16H ? For SecureTSC: GUEST_TSC_FREQ MSR (C001_0134h) provides the TSC frequency, other AMD guests need to calibrate the TSC frequency. + Use kvmclock wallclock + Use TSC for sched clock After reviewing the code, the current init sequence looks correct for both legacy and modern VMs/CPUs. Let kvmclock go ahead and register itself as a clocksource, although registration of the sched clock can be deferred until the kvm-clock is picked up as the clock source. And restore the old sched clock when kvmclock is disabled. Something like the below patch, lightly tested: diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c index 5b2c15214a6b..7167caa3348d 100644 --- a/arch/x86/kernel/kvmclock.c +++ b/arch/x86/kernel/kvmclock.c @@ -21,6 +21,7 @@ #include <asm/hypervisor.h> #include <asm/x86_init.h> #include <asm/kvmclock.h> +#include <asm/timer.h> static int kvmclock __initdata = 1; static int kvmclock_vsyscall __initdata = 1; @@ -148,12 +149,41 @@ bool kvm_check_and_clear_guest_paused(void) return ret; } +static u64 (*old_pv_sched_clock)(void); + +static void enable_kvm_schedclock_work(struct work_struct *work) +{ + u8 flags; + + old_pv_sched_clock = static_call_query(pv_sched_clock); + flags = pvclock_read_flags(&hv_clock_boot[0].pvti); + kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT); +} + +static DECLARE_DELAYED_WORK(enable_kvm_sc, enable_kvm_schedclock_work); + +static void disable_kvm_schedclock_work(struct work_struct *work) +{ + if (old_pv_sched_clock) + paravirt_set_sched_clock(old_pv_sched_clock); +} +static DECLARE_DELAYED_WORK(disable_kvm_sc, disable_kvm_schedclock_work); + static int kvm_cs_enable(struct clocksource *cs) { + u8 flags; + vclocks_set_used(VDSO_CLOCKMODE_PVCLOCK); + schedule_delayed_work(&enable_kvm_sc, 0); + return 0; } +static void kvm_cs_disable(struct clocksource *cs) +{ + schedule_delayed_work(&disable_kvm_sc, 0); +} + static struct clocksource kvm_clock = { .name = "kvm-clock", .read = kvm_clock_get_cycles, @@ -162,6 +192,7 @@ static struct clocksource kvm_clock = { .flags = CLOCK_SOURCE_IS_CONTINUOUS, .id = CSID_X86_KVM_CLK, .enable = kvm_cs_enable, + .disable = kvm_cs_disable, }; static void kvm_register_clock(char *txt) @@ -287,8 +318,6 @@ static int kvmclock_setup_percpu(unsigned int cpu) void __init kvmclock_init(void) { - u8 flags; - if (!kvm_para_available() || !kvmclock) return; @@ -317,9 +346,6 @@ void __init kvmclock_init(void) if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT)) pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT); - flags = pvclock_read_flags(&hv_clock_boot[0].pvti); - kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT); - x86_platform.calibrate_tsc = kvm_get_tsc_khz; x86_platform.calibrate_cpu = kvm_get_tsc_khz; x86_platform.get_wallclock = kvm_get_wallclock; Regards Nikunj