On Wed, Sep 25, 2024, Nikunj A. Dadhania wrote: > >>>>> Are you suggesting that whenever the guest is either SNP or TDX, kvmclock > >>>>> should be disabled assuming that timesource is stable and always running? > >>>> > >>>> No, I'm saying that the guest should prefer the raw TSC over kvmclock if the TSC > >>>> is stable, irrespective of SNP or TDX. This is effectively already done for the > >>>> timekeeping base (see commit 7539b174aef4 ("x86: kvmguest: use TSC clocksource if > >>>> invariant TSC is exposed")), but the scheduler still uses kvmclock thanks to the > >>>> kvm_sched_clock_init() code. > >>> > >>> The kvm-clock and tsc-early both are having the rating of 299. As they are of > >>> same rating, kvm-clock is being picked up first. > >>> > >>> Is it fine to drop the clock rating of kvmclock to 298 ? With this tsc-early will > >>> be picked up instead. > >> > >> IMO, it's ugly, but that's a problem with the rating system inasmuch as anything. > >> > >> But the kernel will still be using kvmclock for the scheduler clock, which is > >> undesirable. > > > > Agree, kvm_sched_clock_init() is still being called. The above hunk was to use > > tsc-early/tsc as the clocksource and not kvm-clock. > > How about the below patch: > > From: Nikunj A Dadhania <nikunj@xxxxxxx> > Date: Tue, 28 Nov 2023 18:29:56 +0530 > Subject: [RFC PATCH] x86/kvmclock: Prefer invariant TSC as the clocksource and > scheduler clock > > For platforms that support stable and always running TSC, although the > kvm-clock rating is dropped to 299 to prefer TSC, the guest scheduler clock > still keeps on using the kvm-clock which is undesirable. Moreover, as the > kvm-clock and early-tsc clocksource are both registered with 299 rating, > kvm-clock is being picked up momentarily instead of selecting more stable > tsc-early clocksource. > > kvm-clock: Using msrs 4b564d01 and 4b564d00 > kvm-clock: using sched offset of 1799357702246960 cycles > clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns > tsc: Detected 1996.249 MHz processor > clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x398cadd9d93, max_idle_ns: 881590552906 ns > clocksource: Switched to clocksource kvm-clock > clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x398cadd9d93, max_idle_ns: 881590552906 ns > clocksource: Switched to clocksource tsc > > Drop the kvm-clock rating to 298, so that tsc-early is picked up before > kvm-clock and use TSC for scheduler clock as well when the TSC is invariant > and stable. > > Signed-off-by: Nikunj A Dadhania <nikunj@xxxxxxx> > > --- > > The issue we see here is that on bare-metal if the TSC is marked unstable, > then the sched-clock will fall back to jiffies. In the virtualization case, > do we want to fall back to kvm-clock when TSC is marked unstable? In the general case, yes. Though that might be a WARN-able offense if the TSC is allegedly constant+nonstop. And for SNP and TDX, it might be a "panic and do not boot" offense, since using kvmclock undermines the security of the guest. > --- > arch/x86/kernel/kvmclock.c | 11 ++++++----- > 1 file changed, 6 insertions(+), 5 deletions(-) > > diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c > index 5b2c15214a6b..c997b2628c4b 100644 > --- a/arch/x86/kernel/kvmclock.c > +++ b/arch/x86/kernel/kvmclock.c > @@ -317,9 +317,6 @@ void __init kvmclock_init(void) > if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT)) > pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT); > > - flags = pvclock_read_flags(&hv_clock_boot[0].pvti); > - kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT); > - > x86_platform.calibrate_tsc = kvm_get_tsc_khz; > x86_platform.calibrate_cpu = kvm_get_tsc_khz; > x86_platform.get_wallclock = kvm_get_wallclock; > @@ -341,8 +338,12 @@ void __init kvmclock_init(void) > */ > if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) && > boot_cpu_has(X86_FEATURE_NONSTOP_TSC) && > - !check_tsc_unstable()) > - kvm_clock.rating = 299; > + !check_tsc_unstable()) { > + kvm_clock.rating = 298; > + } else { > + flags = pvclock_read_flags(&hv_clock_boot[0].pvti); > + kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT); > + } I would really, really like to fix this in a centralized location, not by having each PV clocksource muck with their clock's rating. I'm not even sure the existing code is entirely correct, as kvmclock_init() runs _before_ tsc_early_init(). Which is desirable in the legacy case, as it allows calibrating the TSC using kvmclock, x86_platform.calibrate_tsc = kvm_get_tsc_khz; but on modern setups that's definitely undesirable, as it means the kernel won't use CPUID.0x15, which every explicitly tells software the frequency of the TSC. And I don't think we want to simply point at native_calibrate_tsc(), because that thing is not at all correct for a VM, where checking x86_vendor and x86_vfm is at best sketchy. E.g. I would think it's in AMD's interest for Secure TSC to define the TSC frequency using CPUID.0x15, even if AMD CPUs don't (yet) natively support CPUID.0x15. In other words, I think we need to overhaul the PV clock vs. TSC logic so that it makes sense for modern CPUs+VMs, not just keep hacking away at kvmclock. I don't expect the code would be all that complex in the end, the hardest part is likely just figuring out (and agreeing on) what exactly the kernel should be doing. > clocksource_register_hz(&kvm_clock, NSEC_PER_SEC); > pv_info.name = "KVM"; > -- > 2.34.1 >