Re: [PATCH v16 12/13] x86/tsc: Switch to native sched clock

Borislav Petkov <bp@xxxxxxxxx> · Tue, 21 Jan 2025 12:32:04 +0100

On Fri, Jan 17, 2025 at 12:59:37PM -0800, Sean Christopherson wrote:
> The proposal here, and what kvmclock already does for clocksource, is to use the
> raw TSC when the hypervisor sets bits that effectively say that the massaging of
> TSC done by the PV clock isn't needed.
>
> > But we really really trust it when the guest type is SNP+STSC or TDX since
> > there the HV is out of the picture and the only one who can flub it there is
> > the OEM.
> 
> Yep.  This one _is_ about trust.  Specifically, we trust the raw TSC more than
> any clock that is provided by the HV.

Ok, makes sense.

> It's not KVM guest code though.  The CPUID stuff is Intel's architecturally
> defined behavior.  There are oodles and oodles of features that are transparently
> emulated by the hypervisor according to hardware specifications.  Generally
> speaking, the kernel treats those as "native", e.g. native_wrmsrl(), native_cpuid(),
> etc.

Uuuh, this is calling for confusion.

native_* has always been stuff the kernel calls when it runs, well, natively,
on baremetal. And not some PV or whatever-enlightened ops.

Now, if you want to emulate those transparently that's fine but this is what
native means in arch/x86/. Unless I've missed some memo but I doubt it. And
I asked around :)

> What I am proposing is that, for TDX especially, instead of relying on the
> hypervisor to use a paravirtual channel for communicating the TSC frequency,
> we rely on the hardware-defined way of getting the frequency, because CPUID
> is emulated by the trusted entity, i.e. the OEM.

I wouldn't trust the OEM with a single bit but that's a different story. :-P

> Hmm, though I suppose I'm arguing against myself in that case.  If the
> hypervisor provides the frequency and there are no trust issues, why would
> we care if the kernel gets the frequency via CPUID or the PV channel.  It's
> really only TDX that matters.  And we could handle TDX by overriding
> .calibrate_tsc() in tsc_init(), same as Secure TSC.

I guess it all boils down to the trust level you want to have with the TSC.
I don't know whether there's even a HV which tries to lie to the guest with
the TSC and I guess we'll find out eventually but for now we could treat the
two things similarly. The two things being:

- TDX and STSC SNP guests - full TSC trust
- HV with the required CPUID bits set - almost full, we assume HV is
  benevolent. Could change if we notice something.

> That said, I do think it makes sense to either override the vendor and F/M/S
> checks native_calibrate_tsc().  Or even better drop the initial vendor check
> entirely,

I have no clue why that thing is even there:

aa297292d708 ("x86/tsc: Enumerate SKL cpu_khz and tsc_khz via CPUID")

I guess back then it meant that only Intel sports those new CPUID leafs but
those things do change.

> because both Intel and AMD have a rich history of implementing each other's
> CPUID leaves.  I.e. I see no reason to ignore CPUID 0x15 just because the
> CPU isn't Intel.
> 
> As for the Goldmost F/M/S check, that one is a virtualization specific
> thing.  The argument is that when running as a guest, any non-TSC
> clocksource is going to be emulated by the hypervisor, and therefore is
> going to be less reliable than TSC.  I.e. putting a watchdog on TSC does
> more harm than good, because what ends up happening is the TSC gets marked
> unreliable because the *watchdog* is unreliable.

So I think the best thing to do is to carve out the meat of that function
which is valid for both baremetal and virtualized and then have separate
helpers which call the common thing. So that you don't have to test
on *everything* when having to change it in the future for whatever reason and
so that both camps can be relatively free when implementing their
idiosyncrasies.

And then it should fit in the current scheme with platform function ptrs.

I know you want to use native_calibrate_tsc() but then you have to sprinkle
"am I a guest" checks and this reminds me of the "if XEN" fiasco back then.
Also, a really nasty lying HV won't simply set X86_FEATURE_HYPERVISOR...

So I'd prefer if we fit a guest which runs on a relatively honest HV :) into
the current scheme as the kernel running simply on yet another platform, even
at the price of some small code duplication.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette