On Fri, Jan 17, 2025, Borislav Petkov wrote: > On Thu, Jan 16, 2025 at 08:56:25AM -0800, Sean Christopherson wrote: > > It's only with SNP and TDX that the clocksource becomes at all interesting. > > So basically you're saying, let's just go ahead and trust the TSC when the HV > sets a bunch of CPUID bits. Sort of. It's not a trust thing though. The Xen, KVM, and VMware PV clocks are all based on TSC, i.e. we already "trust" the hypervisor to not muck with TSC. The purpose of such PV clocks is to account for things in software, that aren't handled in hardware. E.g. to provide a constant counter on hardware without a constant TSC. The proposal here, and what kvmclock already does for clocksource, is to use the raw TSC when the hypervisor sets bits that effectively say that the massaging of TSC done by the PV clock isn't needed. > But we really really trust it when the guest type is SNP+STSC or TDX since > there the HV is out of the picture and the only one who can flub it there is > the OEM. Yep. This one _is_ about trust. Specifically, we trust the raw TSC more than any clock that is provided by the HV. > > CPUID 0x15 (and 0x16?) is guaranteed to be available under TDX, and Secure TSC > > would ideally assert that the kernel doesn't switch to some other calibration > > method too. Not sure where to hook into that though, without bleeding TDX and > > SNP details everywhere. > > We could use the platform calibrate* function pointers and assign TDX- or > SNP-specific ones and perhaps even define new such function ptrs. That's what > the platform stuff is for... needs staring, ofc. > > > I agree the naming is weird, but outside of the vendor checks, the VM code is > > identical to the "native" code, so I don't know that it's worth splitting into > > multiple functions. > > > > What if we simply rename it to calibrate_tsc_from_cpuid()? > > This is all wrong layering with all those different guest types having their > own ->calibrate_tsc: > > arch/x86/kernel/cpu/acrn.c:32: x86_platform.calibrate_tsc = acrn_get_tsc_khz; > arch/x86/kernel/cpu/mshyperv.c:424: x86_platform.calibrate_tsc = hv_get_tsc_khz; > arch/x86/kernel/cpu/vmware.c:419: x86_platform.calibrate_tsc = vmware_get_tsc_khz; > arch/x86/kernel/jailhouse.c:213: x86_platform.calibrate_tsc = jailhouse_get_tsc; > arch/x86/kernel/kvmclock.c:323: x86_platform.calibrate_tsc = kvm_get_tsc_khz; > arch/x86/kernel/tsc.c:944: tsc_khz = x86_platform.calibrate_tsc(); > arch/x86/kernel/tsc.c:1458: tsc_khz = x86_platform.calibrate_tsc(); > arch/x86/kernel/x86_init.c:148: .calibrate_tsc = native_calibrate_tsc, > arch/x86/xen/time.c:569: x86_platform.calibrate_tsc = xen_tsc_khz; > > What you want sounds like a redesign to me considering how you want to keep > the KVM guest code and baremetal pretty close... Hmmm, needs staring... It's not KVM guest code though. The CPUID stuff is Intel's architecturally defined behavior. There are oodles and oodles of features that are transparently emulated by the hypervisor according to hardware specifications. Generally speaking, the kernel treats those as "native", e.g. native_wrmsrl(), native_cpuid(), etc. What I am proposing is that, for TDX especially, instead of relying on the hypervisor to use a paravirtual channel for communicating the TSC frequency, we rely on the hardware-defined way of getting the frequency, because CPUID is emulated by the trusted entity, i.e. the OEM. Hmm, though I suppose I'm arguing against myself in that case. If the hypervisor provides the frequency and there are no trust issues, why would we care if the kernel gets the frequency via CPUID or the PV channel. It's really only TDX that matters. And we could handle TDX by overriding .calibrate_tsc() in tsc_init(), same as Secure TSC. That said, I do think it makes sense to either override the vendor and F/M/S checks native_calibrate_tsc(). Or even better drop the initial vendor check entirely, because both Intel and AMD have a rich history of implementing each other's CPUID leaves. I.e. I see no reason to ignore CPUID 0x15 just because the CPU isn't Intel. As for the Goldmost F/M/S check, that one is a virtualization specific thing. The argument is that when running as a guest, any non-TSC clocksource is going to be emulated by the hypervisor, and therefore is going to be less reliable than TSC. I.e. putting a watchdog on TSC does more harm than good, because what ends up happening is the TSC gets marked unreliable because the *watchdog* is unreliable.