On Wed, Jan 15, 2025 at 01:37:25PM -0800, Sean Christopherson wrote: > My strong vote is prefer TSC over kvmclock for sched_clock if the TSC is constant, > nonstop, and not marked stable via command line. So how do we deal with the case here where a malicious HV could set those bits and still tweak the TSC? IOW, I think Secure TSC and TDX should be the only ones who trust the TSC when in a guest. If anything, trusting the TSC in a normal guest should at least issue a warning saying that we do use the TSC but there's no 100% guarantee that it is trustworthy... > But wait, there's more! Because TDX doesn't override .calibrate_tsc() or > .calibrate_cpu(), even though TDX provides a trusted TSC *and* enumerates the > frequency of the TSC, unless I'm missing something, tsc_early_init() will compute > the TSC frequency using the information provided by KVM, i.e. the untrusted host. Yeah, I guess we don't want that. Or at least we should warn about it. > + /* > + * If the TSC counts at a constant frequency across P/T states, counts > + * in deep C-states, and the TSC hasn't been marked unstable, prefer > + * the TSC over kvmclock for sched_clock and drop kvmclock's rating so > + * that TSC is chosen as the clocksource. Note, the TSC unstable check > + * exists purely to honor the TSC being marked unstable via command > + * line, any runtime detection of an unstable will happen after this. > + */ > + if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) && > + boot_cpu_has(X86_FEATURE_NONSTOP_TSC) && > + !check_tsc_unstable()) { > + kvm_clock.rating = 299; > + pr_warn("kvm-clock: Using native sched_clock\n"); The warn is in the right direction but probably should say TSC still cannot be trusted 100%. > diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c > index 0864b314c26a..9baffb425386 100644 > --- a/arch/x86/kernel/tsc.c > +++ b/arch/x86/kernel/tsc.c > @@ -663,7 +663,12 @@ unsigned long native_calibrate_tsc(void) > unsigned int eax_denominator, ebx_numerator, ecx_hz, edx; > unsigned int crystal_khz; > > - if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL) > + /* > + * Ignore the vendor when running as a VM, if the hypervisor provides > + * garbage CPUID information then the vendor is also suspect. > + */ > + if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL && > + !boot_cpu_has(X86_FEATURE_HYPERVISOR)) > return 0; > > if (boot_cpu_data.cpuid_level < 0x15) > @@ -713,10 +718,13 @@ unsigned long native_calibrate_tsc(void) > return 0; > > /* > - * For Atom SoCs TSC is the only reliable clocksource. > - * Mark TSC reliable so no watchdog on it. > + * For Atom SoCs TSC is the only reliable clocksource. Similarly, in a > + * VM, any watchdog is going to be less reliable than the TSC as the > + * watchdog source will be emulated in software. In both cases, mark > + * the TSC reliable so that no watchdog runs on it. > */ > - if (boot_cpu_data.x86_vfm == INTEL_ATOM_GOLDMONT) > + if (boot_cpu_has(X86_FEATURE_HYPERVISOR) || > + boot_cpu_data.x86_vfm == INTEL_ATOM_GOLDMONT) > setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE); > > #ifdef CONFIG_X86_LOCAL_APIC It looks all wrong if a function called native_* sprinkles a bunch of "am I a guest" checks. I guess we should split it into VM and native variants. But yeah, the general direction is ok once we agree on what we do when exactly. Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette