On 12/19/2024 10:33 AM, Sean Christopherson wrote:
For all other CPUID bits, what the TDX Module thinks and/or presents to the guest
is completely irrelevant, at least as far as KVM cares, and to some extent as far
as QEMU cares. This includes the TDX Module's FEATURE_PARAVIRT_CTRL, which frankly
is asinine and should be ignored. IMO, the TDX Module spec is entirely off the
mark in its assessment of paravirtualization. Injecting a #VE instead of a #GP
isn't "paravirtualization".
Take TSC_DEADLINE as an example. "Disabling" the feature from the guest's side
simply means that WRMSR #GPs instead of #VEs.*Nothing* has changed from KVM's
perspective. If the guest makes a TDVMCALL to write IA32_TSC_DEADLINE, KVM has
no idea if the guest has opted in/out of #VE vs #GP. And IMO, a sane guest will
never take a #VE or #GP if it wants to use TSC_DEADLINE; the kernel should instead
make a direct TDVMCALL and save itself a pointless exception.
Enabling Guest TDs are not allowed to access the IA32_TSC_DEADLINE MSR directly.
Virtualization of IA32_TSC_DEADLINE depends on the virtual value of
CPUID(1).ECX[24] bit (TSC Deadline). The host VMM may configure (as an input to
TDH.MNG.INIT) virtual CPUID(1).ECX[24] to be a constant 0 or allow it to be 1
if the CPU’s native value is 1.
If the TDX module supports #VE reduction, as enumerated by TDX_FEATURES0.VE_REDUCTION
(bit 30), and the guest TD has set TD_CTLS.REDUCE_VE to 1, it may control the
value of virtual CPUID(1).ECX[24] by writing TDCS.FEATURE_PARAVIRT_CTRL.TSC_DEADLINE.
• If the virtual value of CPUID(1).ECX[24] is 0, IA32_TSC_DEADLINE is virtualized
as non-existent. WRMSR or RDMSR attempts result in a #GP(0).
• If the virtual value of CPUID(1).ECX[24] is 1, WRMSR or RDMSR attempts result
in a #VE(CONFIG_PARAVIRT). This enables the TD’s #VE handler.
Ditto for TME, MKTME.
FEATURE_PARAVIRT_CTRL.MCA is even weirder, but I still don't see any reason for
KVM or QEMU to care if it's fixed or configurable. There's some crazy logic for
whether or not CR4.MCE can be cleared, but the host can't see guest CR4, and so
once again, the TDX Module's view of MCA is irrelevant when it comes to handling
TDVMCALL for the machine check MSRs.
So I think this again purely comes to back to KVM correctness and safety. More
specifically, the TDX Module needs to report features that are unconditionally
enabled or disabled and can't be emulated by KVM. For everything else, I don't
see any reason to care what the TDX module does.
I'm pretty sure that gives us a way forward. If there only a handful of features
that are unconditionally exposed to the guest, then KVM forces those features in
cpu_caps[*].
I see. Hmm. We could use this new interface to double check the fixed bits. It
seems like a relatively cheap sanity check.
There already is an interface to get CPUID bits (fixed and dynamic). But it only
works after a TD is configured. So if we want to do extra verification or
adjustments, we could use it before entering the TD. Basically, if we delay this
logic we don't need to wait for the fixed bit interface.
Oh, yeah, that'd work. Grab the guest CPUID and then verify that bits KVM needs
to be 0 (or 1) are set according, and WARN+kill if there's a mismatch.
Honestly, I'd probably prefer that over using the fixed bit interface, as my gut
says it's less likely for the TDX Module to misreport what CPUID it has created
for the guest, than it is for the TDX module to generate a "fixed bits" list that
doesn't match the code. E.g. KVM has (had?) more than a few CPUID features that
KVM emulates without actually reporting support to userspace.
The original motivation of the proposed fxied0 and fixed1 data is to
complement the CPUID configurability report, which is important for
userspace. E.g., Currently, what QEMU is doing is hardcoding the fixed0
and fixed1 information of a specific TDX release to adjust the
KVM_GET_SUPPORTED_CPUID result and gets a final "supported" CPUID set
for TDX. Hardcoding is not a good idea and it's better that KVM can get
the data from TDX module, then pass to userspace (of course KVM can
tweak the data based on its own requirement). So, do you think it's
useful to have TDX module report the fixed0/fixed1 data for this purpose?