On Mon, Dec 04, 2023, Dave Hansen wrote: > On 12/4/23 15:24, Huang, Kai wrote: > > On Mon, 2023-12-04 at 14:04 -0800, Hansen, Dave wrote: > ... > > In ancient time KVM used to immediately enable VMX when it is loaded, but later > > it was changed to only enable VMX when there's active VM because of the above > > reason. > > > > See commit 10474ae8945ce ("KVM: Activate Virtualization On Demand"). Huh, I always just assumed it was some backwards thinking about enabling VMX/SVM being "dangerous" or something. > Fine. This doesn't need to change ... until you load TDX. Once you > initialize the TDX module, no more out-of-tree VMMs for you. It's not just out-of-tree hypervisors, which IMO should be little more than an afterthought. The other more important issue is that being post-VMXON blocks INIT, i.e. VMX needs to be disabled before reboot, suspend, etc. Forcing kvm_usage_count would work, but it would essentially turn "graceful" reboots, i.e. reboots where the host isn't running VMs and thus VMX is already disabled. Having VMX be enabled so long as KVM is loaded would turn all reboots into the "oh crap, the system is rebooting, quick do VMXOFF!" variety. > That doesn't seem too insane. This is yet *ANOTHER* reason that doing > dynamic TDX module initialization is a good idea. > > >> It's not wrong to say that TDX is a KVM user. If KVm wants > >> 'kvm_usage_count' to go back to 0, it can shut down the TDX module. Then > >> there's no PAMT to worry about. > >> > >> The shutdown would be something like: > >> > >> 1. TDX module shutdown > >> 2. Deallocate/Convert PAMT > >> 3. vmxoff > >> > >> Then, no SEAMCALL failure because of vmxoff can cause a PAMT-induced #MC > >> to be missed. > > > > The limitation is once the TDX module is shutdown, it cannot be initialized > > again unless it is runtimely updated. > > > > Long-termly, if we go this design then there might be other problems when other > > kernel components are using TDX. For example, the VT-d driver will need to be > > changed to support TDX-IO, and it will need to enable TDX module much earlier > > than KVM to do some initialization. It might need to some TDX work (e.g., > > cleanup) while KVM is unloaded. I am not super familiar with TDX-IO but looks > > we might have some problem here if we go with such design. > > The burden for who does vmxon will simply need to change from KVM itself > to some common code that KVM depends on. Probably not dissimilar to > those nutty (sorry folks, just calling it as I see 'em) multi-KVM module You misspelled "amazing" ;-) > patches that are floating around. Joking aside, why shove TDX module ownership into KVM? It honestly sounds like a terrible fit, even without the whole TDX-IO mess. KVM state is largely ephemeral, in the sense that loading and unloading kvm.ko doesn't allocate/free much memory or do all that much initialization or teardown. TDX on the other hand is quite different. IIRC the PAMT is hundreds of MiB, maybe over a GiB in most expected use cases? And also IIRC, TDH.SYS.INIT is rather long running operation, blocks IRQs, NMIs, (SMIs?), etc. So rather than shove TDX ownership into KVM and force KVM to figure out how to manage the TDX module, why not do what us nutty people are suggesting and move hardware enabling and TDX-module management into a dedicated base module (bonus points if you call it vac.ko ;-) ). Alternatively, we could have a dedicated kernel module for TDX, e.g. tdx.ko, and then have tdx.ko and kvm.ko depend on vac.ko. But I think that ends up being quite gross and unnecessary, e.g. in such a setup, kvm-intel.ko ideally wouldn't take a hard dependency on tdx.ko, as auto-loading tdx.ko would defeat some of the purpose of the split, and KVM shouldn't fail to load just because TDX isn't supported. But that'd mean conditionally doing request_module("tdx") or whatever and would create other conundrums. (Oof, typing that out made me realize that KVM depends on the PSP driver if CONFIG_KVM_AMD_SEV=y, even if if the platform owner has no intention of ever using SEV/SEV-ES. IIUC, it works because sp_mod_init() just registers a driver, i.e. doesn't fail out of there's no PSP. That's kinda gross). Anyways, vac.ko provides an API to grab a reference to the TDX module, e.g. the "create a VM" API gets extended to say "create a VM of the TDX variety", and then vac.ko manages its refcounts to VMX and TDX accordingly. And KVM obviously keeps its existing behavior of getting and putting references for each VM. That way userspace gets to decide when to (un)load tdx.ko without needing to add a KVM module param or whatever to allow forcefully unloading tdx.ko (which would be bizarre and probably quite difficult to implement correctly), and unloading kvm-intel.ko wouldn't require unloading the TDX module. The end behavior might not be all that different in the short term, but it would give us more options, e.g. for this erratum, it would be quite easy for vac.ko to let usersepace choose between keeping VMX "on" (while the TDX module is loaded) and potentially having imperfect #MC messages. And out-of-tree hypervisors could even use vac.ko's exported APIs to manage hardware enabling if they so choose.