On Tue, Dec 17, 2024, Rick P Edgecombe wrote: > On Mon, 2024-12-16 at 17:53 -0800, Sean Christopherson wrote: > > Every new feature that lands in hardware needs to either be "benign" or have the > > appropriate virtualization controls. KVM already has to deal with cases where > > features can effectively be used without KVM's knowledge. E.g. there are plenty > > of instruction-level virtualization holes, and SEV-ES doubled down by essentially > > forcing KVM to let the guest write XCR0 and XSS directly. > > We discussed this in the PUCK call. Argh, I had a response to Xiaoyao all typed up and didn't hit "send" earlier today. > It turns out there were two different ideas on how this fixed bit API would be > used. One is to help userspace understand which configurations are possible. For > this one, I'm not sure how helpful this proposal will be in the long run. I'll > respond on the other branch of the thread. > > The other usage people were thinking of, which I didn't realize before, was to > prevent the TDX module from setting fixed bits that might require VMMs support > (i.e. save/restoring something that could affect the host). The rest of the mail > is about this issue. > > Due to the steps involved in resolving this confusion, and that we didn't really > reach a conclusion, the discussion is hard to summarize. So instead I'll try to > re-kick it off with an idea which has bits and pieces of what people said... > > I think we can't have the TDX module setting new fixed bits that require any VMM > enabling. When we finally have settled upstream TDX support, the TDX module > needs to understand what things KVM relies on so it doesn't break them with > updates. But new fixed CPUID bits that require VMM enabling to prevent host > issues seems like the kind of thing in general that just shouldn't happen. > > As for new configurable bits that require VMM enabling. Adrian was suggesting > that the TDX module currently only has two guest CPUID bits that are problematic > for KVM today (and the next vcpu enter/exit series has a patch to forbid them). > But a re-check of this assertion is warranted. > > It seems like an anti-pattern to have KVM maintaining any code to defend against > TDX module changes that could instead be handled with a promise. I disagree, sanity checking hardware and firmware is a good thing. E.g. see KVM's VMCS checks, the sanity checks for features SEV depends on, etc. That said, I'm not terribly concerned about more features that are unconditionally exposed to the guest, because that will cause problems for other reasons, i.e. Intel should already be heavily incentivized to not do silly things. > However, KVM having code to defend against userspace prodding the TDX module > to do something bad to the host seems valid. So fixed bit issues should be > handled with a promise, but issues related to new configurable bits seems > open. > > Some options discussed on the call: > > 1. If we got a promise to require any new CPUID bits that clobber host state to > require an opt-in (attributes bit, etc) then we could get by with a promise for > that too. The current situation was basically to assume TDX module wouldn't open > up the issue with new CPUID bits (only attributes/xfam). > 2. If we required any new configurable CPUID bits to save/restore host state > automatically then we could also get by, but then KVM's code that does host > save/restore would either be redundant or need a TDX branch. > 3. If we prevent setting any CPUID bits not supported by KVM, we would need to > track these bits in KVM. The data backing GET_SUPPORTED_CPUID is not sufficient > for this purpose since it is actually more like "default values" then a mask of > supported bits. A patch to try to do this filtering was dropped after upstream > discussion.[0] The only CPUID bits that truly matter are those that are associated with hardware features the TDX module allows the guest to use directly. And for those, KVM *must* know if they are fixed0 (inverted polarity only?), fixed1, or configurable. As Adrian asserted, there probably aren't many of them. For all other CPUID bits, what the TDX Module thinks and/or presents to the guest is completely irrelevant, at least as far as KVM cares, and to some extent as far as QEMU cares. This includes the TDX Module's FEATURE_PARAVIRT_CTRL, which frankly is asinine and should be ignored. IMO, the TDX Module spec is entirely off the mark in its assessment of paravirtualization. Injecting a #VE instead of a #GP isn't "paravirtualization". Take TSC_DEADLINE as an example. "Disabling" the feature from the guest's side simply means that WRMSR #GPs instead of #VEs. *Nothing* has changed from KVM's perspective. If the guest makes a TDVMCALL to write IA32_TSC_DEADLINE, KVM has no idea if the guest has opted in/out of #VE vs #GP. And IMO, a sane guest will never take a #VE or #GP if it wants to use TSC_DEADLINE; the kernel should instead make a direct TDVMCALL and save itself a pointless exception. Enabling Guest TDs are not allowed to access the IA32_TSC_DEADLINE MSR directly. Virtualization of IA32_TSC_DEADLINE depends on the virtual value of CPUID(1).ECX[24] bit (TSC Deadline). The host VMM may configure (as an input to TDH.MNG.INIT) virtual CPUID(1).ECX[24] to be a constant 0 or allow it to be 1 if the CPU’s native value is 1. If the TDX module supports #VE reduction, as enumerated by TDX_FEATURES0.VE_REDUCTION (bit 30), and the guest TD has set TD_CTLS.REDUCE_VE to 1, it may control the value of virtual CPUID(1).ECX[24] by writing TDCS.FEATURE_PARAVIRT_CTRL.TSC_DEADLINE. • If the virtual value of CPUID(1).ECX[24] is 0, IA32_TSC_DEADLINE is virtualized as non-existent. WRMSR or RDMSR attempts result in a #GP(0). • If the virtual value of CPUID(1).ECX[24] is 1, WRMSR or RDMSR attempts result in a #VE(CONFIG_PARAVIRT). This enables the TD’s #VE handler. Ditto for TME, MKTME. FEATURE_PARAVIRT_CTRL.MCA is even weirder, but I still don't see any reason for KVM or QEMU to care if it's fixed or configurable. There's some crazy logic for whether or not CR4.MCE can be cleared, but the host can't see guest CR4, and so once again, the TDX Module's view of MCA is irrelevant when it comes to handling TDVMCALL for the machine check MSRs. So I think this again purely comes to back to KVM correctness and safety. More specifically, the TDX Module needs to report features that are unconditionally enabled or disabled and can't be emulated by KVM. For everything else, I don't see any reason to care what the TDX module does. I'm pretty sure that gives us a way forward. If there only a handful of features that are unconditionally exposed to the guest, then KVM forces those features in cpu_caps[*]. I.e. treat them kinda like XSAVES on AMD, where KVM assumes the guest can use XSAVES if it's supported in hardware and XSAVE is exposed to the guest, because AMD didn't provide an interception knob for XSAVES. If the list is "too" long (sujbective) for KVM to hardcode, then we revisit and get the TDX module to provide a list. This probably doesn't solve Xiaoyao's UX problem in QEMU, but I think it gives us a sane approach for KVM. [*] https://lore.kernel.org/all/20241128013424.4096668-1-seanjc@xxxxxxxxxx > Other idea > ---------- > Previously we tried to maintain an allow list of KVM supported configurable bits > [0]. It was do-able, but not ideal. It would be smaller for KVM to protect > itself with a deny list of bits, or rather a list of bits that needs to be in > KVM_GET_SUPPORTED_CPUID, or they should not be allowed to be configured. But KVM > can't keep a list of bits that it doesn't know about. > > But the TDX module does know which bits that it supports result in host state > getting clobbered. So we could ask TDX module to expose a list of bits that have > an effect on host state. We could check those against KVM_GET_SUPPORTED_CPUID. > That check could be expected to fit better than when we tried to massage > KVM_GET_SUPPORTED_CPUID to be a mask that includes all possible configurable > bits (multi-bit fields, etc). > > In the meantime we could keep a list of all of today's host affecting bits. TDX > module would need to gate any new bits that effect host state behind a new sys- > wide opt-in that comes with the "clobber bits" metadata. Before entering a TD, > KVM would check the clobber bits in KVM's copy of CPUID against the TD's copy to > make sure everyone knows what they have to do. > > (and also this opt-in stuff would need to be run by the TDX module team of > course) > > It leaves open the possibility that there is some other bits KVM cares about > that don't have to do with clobbering host state. Not sure about it. > > [0] > https://lore.kernel.org/kvm/20240812224820.34826-26-rick.p.edgecombe@xxxxxxxxx/