On Tue, 2024-06-18 at 18:10 +0300, Nikolay Borisov wrote: > > On 16.06.24 г. 15:01 ч., Kai Huang wrote: > > A TDX module initialization failure was reported on a Emerald Rapids > > platform: > > > > virt/tdx: initialization failed: TDMR [0x0, 0x80000000): reserved areas exhausted. > > virt/tdx: module initialization failed (-28) > > > > As a step of initializing the TDX module, the kernel tells the TDX > > module all the "TDX-usable memory regions" via a set of TDX architecture > > defined structure "TD Memory Region" (TDMR). Each TDMR must be in 1GB > > aligned and in 1GB granularity, and all "non-TDX-usable memory holes" in > > a given TDMR must be marked as a "reserved area". Each TDMR only > > supports a maximum number of reserved areas reported by the TDX module. > > > > As shown above, the root cause of this failure is when the kernel tries > > to construct a TDMR to cover address range [0x0, 0x80000000), there > > are too many memory holes within that range and the number of memory > > holes exceeds the maximum number of reserved areas. > > > > The E820 table of that platform (see [1] below) reflects this: the > > number of memory holes among e820 "usable" entries exceeds 16, which is > > the maximum number of reserved areas TDX module supports in practice. > > > > === Fix === > > > > There are two options to fix this: 1) put less memory holes as "reserved > > area" when constructing a TDMR; 2) reduce the TDMR's size to cover less > > memory regions, thus less memory holes. > > > > Option 1) is possible, and in fact is easier and preferable: > > > > TDX actually has a concept of "Convertible Memory Regions" (CMRs). TDX > > reports a list of CMRs that meet TDX's security requirements on memory. > > TDX requires all the "TDX-usable memory regions" that the kernel passes > > to the module via TDMRs, a.k.a, all the "non-reserved regions in TDMRs", > > must be convertible memory. > > > > In other words, if a memory hole is indeed CMR, then it's not mandatory > > So TDX requires all TDMR to be CMR, and CMR regions are reported by the > BIOS, how did you arrive at the conclusion that if a hole is CMR there > is no point in creating a TDMR for it? TDX requires all "non-reserved areas" in one TDMR to be CMR. TDMR is 1GB aligned, and CMR is 4KB aligned. From TDX's perspective, the kernel must at least add those non-CMR holes as "reserved area". However the kernel _can_ adds more holes (in e820) as "reserved area", despite those holes are CMR physically. > > > for the kernel to add it to the reserved areas. The number of consumed > > reserved areas can be reduced if the kernel doesn't add those memory > > holes as reserved area. Note this doesn't have security impact because > > the kernel is out of TDX's TCB anyway. > > > > This is feasible because in practice the CMRs just reflect the nature of > > whether the RAM can indeed be used by TDX, thus each CMR tends to be a > > large range w/o being split into small areas, e.g., in the way the e820 > > table does to contain a lot "ACPI *" entries. [2] below shows the CMRs > > reported on the problematic platform (using the off-tree TDX code). > > > > So for this particular module initialization failure, the memory holes > > that are within [0x0, 0x80000000) are mostly indeed CMR. By not adding > > them to the reserved areas, the number of consumed reserved areas for > > the TDMR [0x0, 0x80000000) can be dramatically reduced. > > > > On the other hand, although option 2) is also theoretically feasible, it > > requires more complicated logic to handle around splitting TDMR into > > smaller ones. E.g., today one memory region must be fully in one TDMR, > > while splitting TDMR will result in each TDMR only covering part of some > > memory region. And this also increases the total number of TDMRs, which > > also cannot exceed a maximum value that TDX module supports. > > > > <snip> > > > > > Signed-off-by: Kai Huang <kai.huang@xxxxxxxxx> > > --- > > arch/x86/virt/vmx/tdx/tdx.c | 149 ++++++++++++++++++++++++++++++++---- > > arch/x86/virt/vmx/tdx/tdx.h | 13 ++++ > > 2 files changed, 146 insertions(+), 16 deletions(-) > > > > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c > > index ced40e3b516e..88a0c8b788b7 100644 > > --- a/arch/x86/virt/vmx/tdx/tdx.c > > +++ b/arch/x86/virt/vmx/tdx/tdx.c > > @@ -293,6 +293,10 @@ static int stbuf_read_sysmd_field(u64 field_id, void *stbuf, int offset, > > return 0; > > } > > > > +/* Wrapper to read one metadata field to u8/u16/u32/u64 */ > > +#define stbuf_read_sysmd_single(_field_id, _pdata) \ > > + stbuf_read_sysmd_field(_field_id, _pdata, 0, sizeof(typeof(*(_pdata)))) > > What value does adding yet another level of indirection bring here? We could use the raw version instead: read_sys_metadata_field(). This wrapper additionally checks the 'element size' encoded in the field ID matches the size that passed in, so it can catch potential kernel bug. But I can remove this to simplify the code. > > > + > > struct field_mapping { > > u64 field_id; > > int offset; > > @@ -349,6 +353,76 @@ static int get_tdx_module_version(struct tdx_sysinfo_module_version *modver) > > return stbuf_read_sysmd_multi(fields, ARRAY_SIZE(fields), modver); > > } > > > > +/* Update the @cmr_info->num_cmrs to trim tail empty CMRs */ > > +static void trim_empty_tail_cmrs(struct tdx_sysinfo_cmr_info *cmr_info) > > +{ > > + int i; > > + > > + for (i = 0; i < cmr_info->num_cmrs; i++) { > > + u64 cmr_base = cmr_info->cmr_base[i]; > > + u64 cmr_size = cmr_info->cmr_size[i]; > > + > > + if (!cmr_size) { > > + WARN_ON_ONCE(cmr_base); > > + break; > > + } > > + > > + /* TDX architecture: CMR must be 4KB aligned */ > > + WARN_ON_ONCE(!PAGE_ALIGNED(cmr_base) || > > + !PAGE_ALIGNED(cmr_size)); > > + } > > + > > + cmr_info->num_cmrs = i; > > +} > > That function is somewhat weird, on the one hand its name suggests it's > doing some "optimisation" i.e removing empty cmrs, at the same time it > will simply cap the number of CMRs until it meets the first empty CMR, (sorry I found it's hard to read the text above, but I am trying to reply) I wouldn't say it is "optimization". We just want to exclude the empty CMRs so that we don't need to worry about this case when we need to loop over all CMRs. > what aif we have and will also WARN. In fact it could even crash the > machine if panic_on_warn is enabled, > I don't follow why we "will" WARN()? If the BIOS is sane, then this code will never WARN(). In fact, if there are some error in the reported CMRs, the TDX is likely to be broken completely, and we cannot trust the machine to be in working state. So if any WARN() actually happens, to me it's OK to panic the kernel. > furthermore the alignement checks > suggest it's actually some sanity checking function. Furthermore if we > have:" > > ORDINARY_CMR,EMPTY_CMR,ORDINARY_CMR > > (Is such a scenario even possible), in this case we'll ommit also the > last ordinary cmr region? It cannot happen. The fact is: 1) CMR base/size are 4KB aligned. This is architectural behaviour. 2) TDX architecturally supports 32 CMRs maximumly; 3) In practice, TDX can just report the 'NUM_CMRS' metadata field as 32, but there can be empty/null CMRs following valid CMRs. 4) A empty/null CMR between valid CMRs cannot happen. So the kernel at least needs to skip/trim those empty/null tail CMRs. This can be done in a function like above, or we manually check cmr's base/size being 0 when we loop over CMRs. Having a function to trim the tail empty CMRs is way more straightforward. In terms of sanity check, the question is what kind level of sanity check we want the kernel to do. To confess, the WARN()s that were added in this patch is basically because I found them are straightforward to add, because I don't want to add a lot of sanity check just to guard something cannot happen in practice. How about we just remove those WARN()s? > > > + > > +#define TD_SYSINFO_MAP_CMR_INFO(_field_id, _member) \ > > + TD_SYSINFO_MAP(_field_id, struct tdx_sysinfo_cmr_info, _member) > > nit: Again, no real value in introducing yet another level of > indirection in this case. See above. > > > + > > +static int get_tdx_cmr_info(struct tdx_sysinfo_cmr_info *cmr_info) > > +{ > > + int i, ret; > > + > > + ret = stbuf_read_sysmd_single(MD_FIELD_ID_NUM_CMRS, > > + &cmr_info->num_cmrs); > > + if (ret) > > + return ret; > > + > > + for (i = 0; i < cmr_info->num_cmrs; i++) { > > + const struct field_mapping fields[] = { > > + TD_SYSINFO_MAP_CMR_INFO(CMR_BASE0 + i, cmr_base[i]), > > + TD_SYSINFO_MAP_CMR_INFO(CMR_SIZE0 + i, cmr_size[i]), > > + }; > > + > > + ret = stbuf_read_sysmd_multi(fields, ARRAY_SIZE(fields), > > + cmr_info); > > + if (ret) > > + return ret; > > + } > > + > > + /* > > + * The TDX module may just report the maximum number of CMRs that > > + * TDX architecturally supports as the actual number of CMRs, > > + * despite the latter is smaller. In this case all the tail > > + * CMRs will be empty. Trim them away. > > + */ > > + trim_empty_tail_cmrs(cmr_info); > > + > > + return 0; > > +} > > + > > +static void print_cmr_info(struct tdx_sysinfo_cmr_info *cmr_info) > > +{ > > + int i; > > + > > + for (i = 0; i < cmr_info->num_cmrs; i++) { > > + u64 cmr_base = cmr_info->cmr_base[i]; > > + u64 cmr_size = cmr_info->cmr_size[i]; > > + > > + pr_info("CMR[%d]: [0x%llx, 0x%llx)\n", i, cmr_base, > > + cmr_base + cmr_size); > > + } > > +} > > Do we really want to always print all CMR regions, won't that become way > too spammy and isn't this really useful in debug scenarios? Perhaps gate > this particular information behind a debug flag? It is helpful when something goes wrong, and that could happen for non- debug kernels. I think we should just print it like e820 table, for which the kernel uses pr_info().