On 16.06.24 г. 15:01 ч., Kai Huang wrote:
A TDX module initialization failure was reported on a Emerald Rapids
platform:
virt/tdx: initialization failed: TDMR [0x0, 0x80000000): reserved areas exhausted.
virt/tdx: module initialization failed (-28)
As a step of initializing the TDX module, the kernel tells the TDX
module all the "TDX-usable memory regions" via a set of TDX architecture
defined structure "TD Memory Region" (TDMR). Each TDMR must be in 1GB
aligned and in 1GB granularity, and all "non-TDX-usable memory holes" in
a given TDMR must be marked as a "reserved area". Each TDMR only
supports a maximum number of reserved areas reported by the TDX module.
As shown above, the root cause of this failure is when the kernel tries
to construct a TDMR to cover address range [0x0, 0x80000000), there
are too many memory holes within that range and the number of memory
holes exceeds the maximum number of reserved areas.
The E820 table of that platform (see [1] below) reflects this: the
number of memory holes among e820 "usable" entries exceeds 16, which is
the maximum number of reserved areas TDX module supports in practice.
=== Fix ===
There are two options to fix this: 1) put less memory holes as "reserved
area" when constructing a TDMR; 2) reduce the TDMR's size to cover less
memory regions, thus less memory holes.
Option 1) is possible, and in fact is easier and preferable:
TDX actually has a concept of "Convertible Memory Regions" (CMRs). TDX
reports a list of CMRs that meet TDX's security requirements on memory.
TDX requires all the "TDX-usable memory regions" that the kernel passes
to the module via TDMRs, a.k.a, all the "non-reserved regions in TDMRs",
must be convertible memory.
In other words, if a memory hole is indeed CMR, then it's not mandatory
So TDX requires all TDMR to be CMR, and CMR regions are reported by the
BIOS, how did you arrive at the conclusion that if a hole is CMR there
is no point in creating a TDMR for it?
for the kernel to add it to the reserved areas. The number of consumed
reserved areas can be reduced if the kernel doesn't add those memory
holes as reserved area. Note this doesn't have security impact because
the kernel is out of TDX's TCB anyway.
This is feasible because in practice the CMRs just reflect the nature of
whether the RAM can indeed be used by TDX, thus each CMR tends to be a
large range w/o being split into small areas, e.g., in the way the e820
table does to contain a lot "ACPI *" entries. [2] below shows the CMRs
reported on the problematic platform (using the off-tree TDX code).
So for this particular module initialization failure, the memory holes
that are within [0x0, 0x80000000) are mostly indeed CMR. By not adding
them to the reserved areas, the number of consumed reserved areas for
the TDMR [0x0, 0x80000000) can be dramatically reduced.
On the other hand, although option 2) is also theoretically feasible, it
requires more complicated logic to handle around splitting TDMR into
smaller ones. E.g., today one memory region must be fully in one TDMR,
while splitting TDMR will result in each TDMR only covering part of some
memory region. And this also increases the total number of TDMRs, which
also cannot exceed a maximum value that TDX module supports.
<snip>
Signed-off-by: Kai Huang <kai.huang@xxxxxxxxx>
---
arch/x86/virt/vmx/tdx/tdx.c | 149 ++++++++++++++++++++++++++++++++----
arch/x86/virt/vmx/tdx/tdx.h | 13 ++++
2 files changed, 146 insertions(+), 16 deletions(-)
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index ced40e3b516e..88a0c8b788b7 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -293,6 +293,10 @@ static int stbuf_read_sysmd_field(u64 field_id, void *stbuf, int offset,
return 0;
}
+/* Wrapper to read one metadata field to u8/u16/u32/u64 */
+#define stbuf_read_sysmd_single(_field_id, _pdata) \
+ stbuf_read_sysmd_field(_field_id, _pdata, 0, sizeof(typeof(*(_pdata))))
What value does adding yet another level of indirection bring here?
+
struct field_mapping {
u64 field_id;
int offset;
@@ -349,6 +353,76 @@ static int get_tdx_module_version(struct tdx_sysinfo_module_version *modver)
return stbuf_read_sysmd_multi(fields, ARRAY_SIZE(fields), modver);
}
+/* Update the @cmr_info->num_cmrs to trim tail empty CMRs */
+static void trim_empty_tail_cmrs(struct tdx_sysinfo_cmr_info *cmr_info)
+{
+ int i;
+
+ for (i = 0; i < cmr_info->num_cmrs; i++) {
+ u64 cmr_base = cmr_info->cmr_base[i];
+ u64 cmr_size = cmr_info->cmr_size[i];
+
+ if (!cmr_size) {
+ WARN_ON_ONCE(cmr_base);
+ break;
+ }
+
+ /* TDX architecture: CMR must be 4KB aligned */
+ WARN_ON_ONCE(!PAGE_ALIGNED(cmr_base) ||
+ !PAGE_ALIGNED(cmr_size));
+ }
+
+ cmr_info->num_cmrs = i;
+}
That function is somewhat weird, on the one hand its name suggests it's
doing some "optimisation" i.e removing empty cmrs, at the same time it
will simply cap the number of CMRs until it meets the first empty CMR,
what aif we have and will also WARN. In fact it could even crash the
machine if panic_on_warn is enabled, furthermore the alignement checks
suggest it's actually some sanity checking function. Furthermore if we
have:"
ORDINARY_CMR,EMPTY_CMR,ORDINARY_CMR
(Is such a scenario even possible), in this case we'll ommit also the
last ordinary cmr region?
+
+#define TD_SYSINFO_MAP_CMR_INFO(_field_id, _member) \
+ TD_SYSINFO_MAP(_field_id, struct tdx_sysinfo_cmr_info, _member)
nit: Again, no real value in introducing yet another level of
indirection in this case.
+
+static int get_tdx_cmr_info(struct tdx_sysinfo_cmr_info *cmr_info)
+{
+ int i, ret;
+
+ ret = stbuf_read_sysmd_single(MD_FIELD_ID_NUM_CMRS,
+ &cmr_info->num_cmrs);
+ if (ret)
+ return ret;
+
+ for (i = 0; i < cmr_info->num_cmrs; i++) {
+ const struct field_mapping fields[] = {
+ TD_SYSINFO_MAP_CMR_INFO(CMR_BASE0 + i, cmr_base[i]),
+ TD_SYSINFO_MAP_CMR_INFO(CMR_SIZE0 + i, cmr_size[i]),
+ };
+
+ ret = stbuf_read_sysmd_multi(fields, ARRAY_SIZE(fields),
+ cmr_info);
+ if (ret)
+ return ret;
+ }
+
+ /*
+ * The TDX module may just report the maximum number of CMRs that
+ * TDX architecturally supports as the actual number of CMRs,
+ * despite the latter is smaller. In this case all the tail
+ * CMRs will be empty. Trim them away.
+ */
+ trim_empty_tail_cmrs(cmr_info);
+
+ return 0;
+}
+
+static void print_cmr_info(struct tdx_sysinfo_cmr_info *cmr_info)
+{
+ int i;
+
+ for (i = 0; i < cmr_info->num_cmrs; i++) {
+ u64 cmr_base = cmr_info->cmr_base[i];
+ u64 cmr_size = cmr_info->cmr_size[i];
+
+ pr_info("CMR[%d]: [0x%llx, 0x%llx)\n", i, cmr_base,
+ cmr_base + cmr_size);
+ }
+}
Do we really want to always print all CMR regions, won't that become way
too spammy and isn't this really useful in debug scenarios? Perhaps gate
this particular information behind a debug flag?
+
static void print_basic_sysinfo(struct tdx_sysinfo *sysinfo)
{
struct tdx_sysinfo_module_version *modver = &sysinfo->module_version;
<snip>