On Thu, Jan 16, 2025, Borislav Petkov wrote: > On Wed, Jan 15, 2025 at 01:37:25PM -0800, Sean Christopherson wrote: > > My strong vote is prefer TSC over kvmclock for sched_clock if the TSC is constant, > > nonstop, and not marked stable via command line. > > So how do we deal with the case here where a malicious HV could set those bits > and still tweak the TSC? You don't. If the hypervisor is malicious, it's game over for non-CoCo guests. The clocksource being untrusted is completely unintersting because the host has full access to VM state. It's probably game over for SEV and SEV-ES too, e.g. thanks to remapping attacks, lack of robust attestation, and a variety of other avenues a truly malicious hypervisor can exploit to attack the guest. It's only with SNP and TDX that the clocksource becomes at all interesting. > IOW, I think Secure TSC and TDX should be the only ones who trust the TSC when > in a guest. > > If anything, trusting the TSC in a normal guest should at least issue > a warning saying that we do use the TSC but there's no 100% guarantee that it > is trustworthy... Why? For non-TDX/SecureTSC guests, *all* clocksources are host controlled. > > But wait, there's more! Because TDX doesn't override .calibrate_tsc() or > > .calibrate_cpu(), even though TDX provides a trusted TSC *and* enumerates the > > frequency of the TSC, unless I'm missing something, tsc_early_init() will compute > > the TSC frequency using the information provided by KVM, i.e. the untrusted host. > > Yeah, I guess we don't want that. Or at least we should warn about it. CPUID 0x15 (and 0x16?) is guaranteed to be available under TDX, and Secure TSC would ideally assert that the kernel doesn't switch to some other calibration method too. Not sure where to hook into that though, without bleeding TDX and SNP details everywhere. > > + /* > > + * If the TSC counts at a constant frequency across P/T states, counts > > + * in deep C-states, and the TSC hasn't been marked unstable, prefer > > + * the TSC over kvmclock for sched_clock and drop kvmclock's rating so > > + * that TSC is chosen as the clocksource. Note, the TSC unstable check > > + * exists purely to honor the TSC being marked unstable via command > > + * line, any runtime detection of an unstable will happen after this. > > + */ > > + if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) && > > + boot_cpu_has(X86_FEATURE_NONSTOP_TSC) && > > + !check_tsc_unstable()) { > > + kvm_clock.rating = 299; > > + pr_warn("kvm-clock: Using native sched_clock\n"); > > The warn is in the right direction but probably should say TSC still cannot be > trusted 100%. Heh, I didn't mean to include the printks, they were purely for my own debugging. > > diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c > > index 0864b314c26a..9baffb425386 100644 > > --- a/arch/x86/kernel/tsc.c > > +++ b/arch/x86/kernel/tsc.c > > @@ -663,7 +663,12 @@ unsigned long native_calibrate_tsc(void) > > unsigned int eax_denominator, ebx_numerator, ecx_hz, edx; > > unsigned int crystal_khz; > > > > - if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL) > > + /* > > + * Ignore the vendor when running as a VM, if the hypervisor provides > > + * garbage CPUID information then the vendor is also suspect. > > + */ > > + if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL && > > + !boot_cpu_has(X86_FEATURE_HYPERVISOR)) > > return 0; > > > > if (boot_cpu_data.cpuid_level < 0x15) > > @@ -713,10 +718,13 @@ unsigned long native_calibrate_tsc(void) > > return 0; > > > > /* > > - * For Atom SoCs TSC is the only reliable clocksource. > > - * Mark TSC reliable so no watchdog on it. > > + * For Atom SoCs TSC is the only reliable clocksource. Similarly, in a > > + * VM, any watchdog is going to be less reliable than the TSC as the > > + * watchdog source will be emulated in software. In both cases, mark > > + * the TSC reliable so that no watchdog runs on it. > > */ > > - if (boot_cpu_data.x86_vfm == INTEL_ATOM_GOLDMONT) > > + if (boot_cpu_has(X86_FEATURE_HYPERVISOR) || > > + boot_cpu_data.x86_vfm == INTEL_ATOM_GOLDMONT) > > setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE); > > > > #ifdef CONFIG_X86_LOCAL_APIC > > It looks all wrong if a function called native_* sprinkles a bunch of "am > I a guest" checks. I guess we should split it into VM and native variants. I agree the naming is weird, but outside of the vendor checks, the VM code is identical to the "native" code, so I don't know that it's worth splitting into multiple functions. What if we simply rename it to calibrate_tsc_from_cpuid()? > But yeah, the general direction is ok once we agree on what we do when > exactly.