On Wed, Dec 06, 2023 at 06:49:11PM +0100, Jeremi Piotrowski wrote: > On 05/12/2023 11:54, Kirill A. Shutemov wrote: > > On Mon, Dec 04, 2023 at 08:07:38PM +0100, Jeremi Piotrowski wrote: > >> On 04/12/2023 10:17, Reshetova, Elena wrote: > >>>> Check for additional CPUID bits to identify TDX guests running with Trust > >>>> Domain (TD) partitioning enabled. TD partitioning is like nested virtualization > >>>> inside the Trust Domain so there is a L1 TD VM(M) and there can be L2 TD VM(s). > >>>> > >>>> In this arrangement we are not guaranteed that the TDX_CPUID_LEAF_ID is > >>>> visible > >>>> to Linux running as an L2 TD VM. This is because a majority of TDX facilities > >>>> are controlled by the L1 VMM and the L2 TDX guest needs to use TD partitioning > >>>> aware mechanisms for what's left. So currently such guests do not have > >>>> X86_FEATURE_TDX_GUEST set. > >>> > >>> Back to this concrete patch. Why cannot L1 VMM emulate the correct value of > >>> the TDX_CPUID_LEAF_ID to L2 VM? It can do this per TDX partitioning arch. > >>> How do you handle this and other CPUID calls call currently in L1? Per spec, > >>> all CPUIDs calls from L2 will cause L2 --> L1 exit, so what do you do in L1? > >> The disclaimer here is that I don't have access to the paravisor (L1) code. But > >> to the best of my knowledge the L1 handles CPUID calls by calling into the TDX > >> module, or synthesizing a response itself. TDX_CPUID_LEAF_ID is not provided to > >> the L2 guest in order to discriminate a guest that is solely responsible for every > >> TDX mechanism (running at L1) from one running at L2 that has to cooperate with L1. > >> More below. > >> > >>> > >>> Given that you do that simple emulation, you already end up with TDX guest > >>> code being activated. Next you can check what features you wont be able to > >>> provide in L1 and create simple emulation calls for the TDG calls that must be > >>> supported and cannot return error. The biggest TDG call (TDVMCALL) is already > >>> direct call into L0 VMM, so this part doesn’t require L1 VMM support. > >> > >> I don't see anything in the TD-partitioning spec that gives the TDX guest a way > >> to detect if it's running at L2 or L1, or check whether TDVMCALLs go to L0/L1. > >> So in any case this requires an extra cpuid call to establish the environment. > >> Given that, exposing TDX_CPUID_LEAF_ID to the guest doesn't help. > >> > >> I'll give some examples of where the idea of emulating a TDX environment > >> without attempting L1-L2 cooperation breaks down. > >> > >> hlt: if the guest issues a hlt TDVMCALL it goes to L0, but if it issues a classic hlt > >> it traps to L1. The hlt should definitely go to L1 so that L1 has a chance to do > >> housekeeping. > > > > Why would L2 issue HLT TDVMCALL? It only happens in response to #VE, but > > if partitioning enabled #VEs are routed to L1 anyway. > > What about tdx_safe_halt? When X86_FEATURE_TDX_GUEST is defined I see > "using TDX aware idle routing" in dmesg. Yeah. I forgot about this one. My bad. :/ I think it makes a case for more fine-grained control on where TDVMCALL routed: to L1 or to L0. I think TDX module can do that. BTW, what kind of housekeeping do you do in L1 for HLT case? > >> map gpa: say the guest uses MAP_GPA TDVMCALL. This goes to L0, not L1 which is the actual > >> entity that needs to have a say in performing the conversion. L1 can't act on the request > >> if L0 would forward it because of the CoCo threat model. So L1 and L2 get out of sync. > >> The only safe approach is for L2 to use a different mechanism to trap to L1 explicitly. > > > > Hm? L1 is always in loop on share<->private conversion. I don't know why > > you need MAP_GPA for that. > > > > You can't rely on MAP_GPA anyway. It is optional (unfortunately). Conversion > > doesn't require MAP_GPA call. > > > > I'm sorry, I don't quite follow. I'm reading tdx_enc_status_changed(): > - TDVMCALL_MAP_GPA is issued for all transitions > - TDX_ACCEPT_PAGE is issued for shared->private transitions I am talking about TDX architecture. It doesn't require MAP_GPA call. Just setting shared bit and touching the page will do the conversion. MAP_GPA is "being nice" on the guest behalf. Linux do MAP_GPA all the time. Or tries to. I had bug where I converted page by mistake this way. It was pain to debug. My point is that if you *must* catch all conversions in L1, MAP_GPA is not reliable way. > This doesn't work in partitioning when TDVMCALLs go to L0: TDVMCALL_MAP_GPA bypasses > L1 and TDX_ACCEPT_PAGE is L1 responsibility. > > If you want to see how this is currently supported take a look at arch/x86/hyperv/ivm.c. > All memory starts as private and there is a hypercall to notify the paravisor for both > TDX (when partitioning) and SNP (when VMPL). This guarantees that all page conversions > go through L1. But L1 guest control anyway during page conversion and it has to manage aliases with TDG.MEM.PAGE.ATTR.RD/WR. Why do you need MAP_GPA for that? > >> Having a paravisor is required to support a TPM and having TDVMCALLs go to L0 is > >> required to make performance viable for real workloads. > >> > >>> > >>> Until we really see what breaks with this approach, I don’t think it is worth to > >>> take in the complexity to support different L1 hypervisors view on partitioning. > >>> > >> > >> I'm not asking to support different L1 hypervisors view on partitioning, I want to > >> clean up the code (by fixing assumptions that no longer hold) for the model that I'm > >> describing that: the kernel already supports, has an implementation that works and > >> has actual users. This is also a model that Intel intentionally created the TD-partitioning > >> spec to support. > >> > >> So lets work together to make X86_FEATURE_TDX_GUEST match reality. > > > > I think the right direction is to make TDX architecture good enough > > without that. If we need more hooks in TDX module that give required > > control to L1, let's do that. (I don't see it so far) > > > > I'm not the right person to propose changes to the TDX module, I barely know anything about > TDX. The team that develops the paravisor collaborates with Intel on it and was also consulted > in TD-partitioning design. One possible change I mentioned above: make TDVMCALL exit to L1 for some TDVMCALL leafs (or something along the line). I would like to keep it transparent for enlightened TDX Linux guest. It should not care if it runs as L1 or as L2 in your environment. > I'm also not sure what kind of changes you envision. Everything is supported by the > kernel already and the paravisor ABI is meant to stay vendor independent. > > What I'm trying to accomplish is better integration with the non-partitioning side of TDX > so that users don't see "Memory Encryption Features active: AMD SEV" when running on Intel > TDX with a paravisor. This part is cosmetics and doesn't make much difference. -- Kiryl Shutsemau / Kirill A. Shutemov