Re: [PATCH v16 12/13] x86/tsc: Switch to native sched clock

Sean Christopherson <seanjc@xxxxxxxxxx> · Fri, 17 Jan 2025 12:59:37 -0800

On Fri, Jan 17, 2025, Borislav Petkov wrote:
> On Thu, Jan 16, 2025 at 08:56:25AM -0800, Sean Christopherson wrote:
> > It's only with SNP and TDX that the clocksource becomes at all interesting.
> 
> So basically you're saying, let's just go ahead and trust the TSC when the HV
> sets a bunch of CPUID bits.

Sort of.  It's not a trust thing though.  The Xen, KVM, and VMware PV clocks are
all based on TSC, i.e. we already "trust" the hypervisor to not muck with TSC.

The purpose of such PV clocks is to account for things in software, that aren't
handled in hardware.  E.g. to provide a constant counter on hardware without a
constant TSC.

The proposal here, and what kvmclock already does for clocksource, is to use the
raw TSC when the hypervisor sets bits that effectively say that the massaging of
TSC done by the PV clock isn't needed.

> But we really really trust it when the guest type is SNP+STSC or TDX since
> there the HV is out of the picture and the only one who can flub it there is
> the OEM.

Yep.  This one _is_ about trust.  Specifically, we trust the raw TSC more than
any clock that is provided by the HV.

> > CPUID 0x15 (and 0x16?) is guaranteed to be available under TDX, and Secure TSC
> > would ideally assert that the kernel doesn't switch to some other calibration
> > method too.  Not sure where to hook into that though, without bleeding TDX and
> > SNP details everywhere.
> 
> We could use the platform calibrate* function pointers and assign TDX- or
> SNP-specific ones and perhaps even define new such function ptrs. That's what
> the platform stuff is for... needs staring, ofc.
> 
> > I agree the naming is weird, but outside of the vendor checks, the VM code is
> > identical to the "native" code, so I don't know that it's worth splitting into
> > multiple functions.
> > 
> > What if we simply rename it to calibrate_tsc_from_cpuid()?
> 
> This is all wrong layering with all those different guest types having their
> own ->calibrate_tsc:
> 
> arch/x86/kernel/cpu/acrn.c:32:  x86_platform.calibrate_tsc = acrn_get_tsc_khz;
> arch/x86/kernel/cpu/mshyperv.c:424:             x86_platform.calibrate_tsc = hv_get_tsc_khz;
> arch/x86/kernel/cpu/vmware.c:419:               x86_platform.calibrate_tsc = vmware_get_tsc_khz;
> arch/x86/kernel/jailhouse.c:213:        x86_platform.calibrate_tsc              = jailhouse_get_tsc;
> arch/x86/kernel/kvmclock.c:323: x86_platform.calibrate_tsc = kvm_get_tsc_khz;
> arch/x86/kernel/tsc.c:944:      tsc_khz = x86_platform.calibrate_tsc();
> arch/x86/kernel/tsc.c:1458:                     tsc_khz = x86_platform.calibrate_tsc();
> arch/x86/kernel/x86_init.c:148: .calibrate_tsc                  = native_calibrate_tsc,
> arch/x86/xen/time.c:569:        x86_platform.calibrate_tsc = xen_tsc_khz;
> 
> What you want sounds like a redesign to me considering how you want to keep
> the KVM guest code and baremetal pretty close... Hmmm, needs staring...

It's not KVM guest code though.  The CPUID stuff is Intel's architecturally
defined behavior.  There are oodles and oodles of features that are transparently
emulated by the hypervisor according to hardware specifications.  Generally
speaking, the kernel treats those as "native", e.g. native_wrmsrl(), native_cpuid(),
etc.

What I am proposing is that, for TDX especially, instead of relying on the hypervisor
to use a paravirtual channel for communicating the TSC frequency, we rely on the
hardware-defined way of getting the frequency, because CPUID is emulated by the
trusted entity, i.e. the OEM.

Hmm, though I suppose I'm arguing against myself in that case.  If the hypervisor
provides the frequency and there are no trust issues, why would we care if the
kernel gets the frequency via CPUID or the PV channel.  It's really only TDX that
matters.  And we could handle TDX by overriding .calibrate_tsc() in tsc_init(),
same as Secure TSC.

That said, I do think it makes sense to either override the vendor and F/M/S
checks native_calibrate_tsc().  Or even better drop the initial vendor check
entirely, because both Intel and AMD have a rich history of implementing each
other's CPUID leaves.  I.e. I see no reason to ignore CPUID 0x15 just because
the CPU isn't Intel.

As for the Goldmost F/M/S check, that one is a virtualization specific thing.
The argument is that when running as a guest, any non-TSC clocksource is going
to be emulated by the hypervisor, and therefore is going to be less reliable than
TSC.  I.e. putting a watchdog on TSC does more harm than good, because what ends
up happening is the TSC gets marked unreliable because the *watchdog* is unreliable.