Re: (Proposal) New TDX Global Metadata To Report FIXED0 and FIXED1 CPUID Bits

"Edgecombe, Rick P" <rick.p.edgecombe@xxxxxxxxx> · Thu, 19 Dec 2024 01:56:23 +0000

On Tue, 2024-12-17 at 16:08 -0800, Sean Christopherson wrote:
> On Tue, Dec 17, 2024, Rick P Edgecombe wrote:
> > It seems like an anti-pattern to have KVM maintaining any code to defend against
> > TDX module changes that could instead be handled with a promise. 
> 
> I disagree, sanity checking hardware and firmware is a good thing.  E.g. see KVM's
> VMCS checks, the sanity checks for features SEV depends on, etc.

Ok. More softly, KVM should not do something overly complicated if it could
instead be handled by TDX module adopting reasonable policies. Sanity checks are
supposed to be easy. The reason I'm probing further is it feeds into how we
position things with the TDX module team.

> 
> That said, I'm not terribly concerned about more features that are unconditionally
> exposed to the guest, because that will cause problems for other reasons, i.e.
> Intel should already be heavily incentivized to not do silly things.

Agree. That was sort of the expectation until these two bits came up in the TD
enter/exit discussion.

> 
> > However, KVM having code to defend against userspace prodding the TDX module
> > to do something bad to the host seems valid. So fixed bit issues should be
> > handled with a promise, but issues related to new configurable bits seems
> > open.
> > 
> > Some options discussed on the call:
> > 
> > 1. If we got a promise to require any new CPUID bits that clobber host state to
> > require an opt-in (attributes bit, etc) then we could get by with a promise for
> > that too. The current situation was basically to assume TDX module wouldn't open
> > up the issue with new CPUID bits (only attributes/xfam).
> > 2. If we required any new configurable CPUID bits to save/restore host state
> > automatically then we could also get by, but then KVM's code that does host
> > save/restore would either be redundant or need a TDX branch.
> > 3. If we prevent setting any CPUID bits not supported by KVM, we would need to
> > track these bits in KVM. The data backing GET_SUPPORTED_CPUID is not sufficient
> > for this purpose since it is actually more like "default values" then a mask of
> > supported bits. A patch to try to do this filtering was dropped after upstream
> > discussion.[0]
> 
> The only CPUID bits that truly matter are those that are associated with hardware
> features the TDX module allows the guest to use directly.  And for those, KVM
> *must* know if they are fixed0 (inverted polarity only?), fixed1, or configurable.
> As Adrian asserted, there probably aren't many of them.

I don't follow why the fixed bits are special here. If any configuration can
affect the host, KVM needs to know about it. Whether it is fixed or
configurable.

I wonder if there could be some confusion about how much KVM can trust that its
view of the CPUID bits is the same as the TDX Modules? In the current patches
userspace is responsible for assembling KVM's CPUID data which it sets with
KVM_SET_CPUID2 like normal. It fetches all the set bits from the TDX module,
massages them, and pass them to KVM. So if a host affecting configurable bit is
set in the TDX module, but not in KVM then it could be a problem. I think we
need to reassess which bits could affect host state, and make sure we re-check
them before entering the TD. But we can't simply check that all bits match
because there are some bits that are set in KVM, but not TDX module (real PV
leafs, guestmaxpa, etc).

So that is how I arrived at that we need some list of host affecting bits to
verify match in the TD.

> 
> For all other CPUID bits, what the TDX Module thinks and/or presents to the guest
> is completely irrelevant, at least as far as KVM cares, and to some extent as far
> as QEMU cares.  This includes the TDX Module's FEATURE_PARAVIRT_CTRL, which frankly
> is asinine and should be ignored.  IMO, the TDX Module spec is entirely off the
> mark in its assessment of paravirtualization.  Injecting a #VE instead of a #GP
> isn't "paravirtualization".
>  
> Take TSC_DEADLINE as an example.  "Disabling" the feature from the guest's side
> simply means that WRMSR #GPs instead of #VEs.  *Nothing* has changed from KVM's
> perspective.  If the guest makes a TDVMCALL to write IA32_TSC_DEADLINE, KVM has
> no idea if the guest has opted in/out of #VE vs #GP.  And IMO, a sane guest will
> never take a #VE or #GP if it wants to use TSC_DEADLINE; the kernel should instead
> make a direct TDVMCALL and save itself a pointless exception.
> 
>   Enabling Guest TDs are not allowed to access the IA32_TSC_DEADLINE MSR directly.
>   Virtualization of IA32_TSC_DEADLINE depends on the virtual value of
>   CPUID(1).ECX[24] bit (TSC Deadline). The host VMM may configure (as an input to
>   TDH.MNG.INIT) virtual CPUID(1).ECX[24] to be a constant 0 or allow it to be 1
>   if the CPU’s native value is 1.
> 
>   If the TDX module supports #VE reduction, as enumerated by TDX_FEATURES0.VE_REDUCTION
>   (bit 30), and the guest TD has set TD_CTLS.REDUCE_VE to 1, it may control the
>   value of virtual CPUID(1).ECX[24] by writing TDCS.FEATURE_PARAVIRT_CTRL.TSC_DEADLINE. 
> 
>   • If the virtual value of CPUID(1).ECX[24] is 0, IA32_TSC_DEADLINE is virtualized
>     as non-existent. WRMSR or RDMSR attempts result in a #GP(0).
> 
>   • If the virtual value of CPUID(1).ECX[24] is 1, WRMSR or RDMSR attempts result
>     in a #VE(CONFIG_PARAVIRT). This enables the TD’s #VE handler.
> 
> Ditto for TME, MKTME.
> 
> FEATURE_PARAVIRT_CTRL.MCA is even weirder, but I still don't see any reason for
> KVM or QEMU to care if it's fixed or configurable.  There's some crazy logic for
> whether or not CR4.MCE can be cleared, but the host can't see guest CR4, and so
> once again, the TDX Module's view of MCA is irrelevant when it comes to handling
> TDVMCALL for the machine check MSRs.
> 
> So I think this again purely comes to back to KVM correctness and safety.  More
> specifically, the TDX Module needs to report features that are unconditionally
> enabled or disabled and can't be emulated by KVM.  For everything else, I don't
> see any reason to care what the TDX module does.
> 
> I'm pretty sure that gives us a way forward.  If there only a handful of features
> that are unconditionally exposed to the guest, then KVM forces those features in
> cpu_caps[*].

I see. Hmm. We could use this new interface to double check the fixed bits. It
seems like a relatively cheap sanity check.

There already is an interface to get CPUID bits (fixed and dynamic). But it only
works after a TD is configured. So if we want to do extra verification or
adjustments, we could use it before entering the TD. Basically, if we delay this
logic we don't need to wait for the fixed bit interface.

What is special about the new proposed fixed bit interface is that it can run
before a TD runs (when QEMU wants to do it's checking of the users args).

> 
>   I.e. treat them kinda like XSAVES on AMD, where KVM assumes the
> guest can use XSAVES if it's supported in hardware and XSAVE is exposed to the
> guest, because AMD didn't provide an interception knob for XSAVES.
> 
> If the list is "too" long (sujbective) for KVM to hardcode, then we revisit and
> get the TDX module to provide a list.
> 
> This probably doesn't solve Xiaoyao's UX problem in QEMU, but I think it gives
> us a sane approach for KVM.
> 
> [*] https://lore.kernel.org/all/20241128013424.4096668-1-seanjc@xxxxxxxxxx