Re: [PATCH v16 12/13] x86/tsc: Switch to native sched clock

Borislav Petkov <bp@xxxxxxxxx> · Thu, 16 Jan 2025 17:25:25 +0100

On Wed, Jan 15, 2025 at 01:37:25PM -0800, Sean Christopherson wrote:
> My strong vote is prefer TSC over kvmclock for sched_clock if the TSC is constant,
> nonstop, and not marked stable via command line.

So how do we deal with the case here where a malicious HV could set those bits
and still tweak the TSC?

IOW, I think Secure TSC and TDX should be the only ones who trust the TSC when
in a guest.

If anything, trusting the TSC in a normal guest should at least issue
a warning saying that we do use the TSC but there's no 100% guarantee that it
is trustworthy...

> But wait, there's more!  Because TDX doesn't override .calibrate_tsc() or
> .calibrate_cpu(), even though TDX provides a trusted TSC *and* enumerates the
> frequency of the TSC, unless I'm missing something, tsc_early_init() will compute
> the TSC frequency using the information provided by KVM, i.e. the untrusted host.

Yeah, I guess we don't want that. Or at least we should warn about it.

> +	/*
> +	 * If the TSC counts at a constant frequency across P/T states, counts
> +	 * in deep C-states, and the TSC hasn't been marked unstable, prefer
> +	 * the TSC over kvmclock for sched_clock and drop kvmclock's rating so
> +	 * that TSC is chosen as the clocksource.  Note, the TSC unstable check
> +	 * exists purely to honor the TSC being marked unstable via command
> +	 * line, any runtime detection of an unstable will happen after this.
> +	 */
> +	if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
> +	    boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
> +	    !check_tsc_unstable()) {
> +		kvm_clock.rating = 299;
> +		pr_warn("kvm-clock: Using native sched_clock\n");

The warn is in the right direction but probably should say TSC still cannot be
trusted 100%.

> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> index 0864b314c26a..9baffb425386 100644
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -663,7 +663,12 @@ unsigned long native_calibrate_tsc(void)
>  	unsigned int eax_denominator, ebx_numerator, ecx_hz, edx;
>  	unsigned int crystal_khz;
>  
> -	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
> +	/*
> +	 * Ignore the vendor when running as a VM, if the hypervisor provides
> +	 * garbage CPUID information then the vendor is also suspect.
> +	 */
> +	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL &&
> +	    !boot_cpu_has(X86_FEATURE_HYPERVISOR))
>  		return 0;
>  
>  	if (boot_cpu_data.cpuid_level < 0x15)
> @@ -713,10 +718,13 @@ unsigned long native_calibrate_tsc(void)
>  		return 0;
>  
>  	/*
> -	 * For Atom SoCs TSC is the only reliable clocksource.
> -	 * Mark TSC reliable so no watchdog on it.
> +	 * For Atom SoCs TSC is the only reliable clocksource.  Similarly, in a
> +	 * VM, any watchdog is going to be less reliable than the TSC as the
> +	 * watchdog source will be emulated in software.  In both cases, mark
> +	 * the TSC reliable so that no watchdog runs on it.
>  	 */
> -	if (boot_cpu_data.x86_vfm == INTEL_ATOM_GOLDMONT)
> +	if (boot_cpu_has(X86_FEATURE_HYPERVISOR) ||
> +	    boot_cpu_data.x86_vfm == INTEL_ATOM_GOLDMONT)
>  		setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
>  
>  #ifdef CONFIG_X86_LOCAL_APIC

It looks all wrong if a function called native_* sprinkles a bunch of "am
I a guest" checks. I guess we should split it into VM and native variants.

But yeah, the general direction is ok once we agree on what we do when
exactly.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette