Re: [PATCH v3 7/8] x86/virt/tdx: Reduce TDMR's reserved areas by using CMRs to find memory holes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 27/08/24 10:14, Kai Huang wrote:
> A TDX module initialization failure was reported on a Emerald Rapids
> platform:
> 
>   virt/tdx: initialization failed: TDMR [0x0, 0x80000000): reserved areas exhausted.
>   virt/tdx: module initialization failed (-28)
> 
> As part of initializing the TDX module, the kernel informs the TDX
> module of all "TDX-usable memory regions" using an array of TDX defined
> structure "TD Memory Region" (TDMR).  Each TDMR must be in 1GB aligned
> and in 1GB granularity, and all "non-TDX-usable memory holes" within a
> given TDMR must be marked as "reserved areas".  The TDX module reports a
> maximum number of reserved areas that can be supported per TDMR.

The statement:

	... all "non-TDX-usable memory holes" within a
	given TDMR must be marked as "reserved areas".

is not exactly true, which is essentially the basis of this fix.

The relevant requirements are (from the spec):

  Any non-reserved 4KB page within a TDMR must be convertible
  i.e., it must be within a CMR

  Reserved areas within a TDMR need not be within a CMR.

  PAMT areas must not overlap with TDMR non-reserved areas;
  however, they may reside within TDMR reserved areas
  (as long as these are convertible).

> 
> Currently, the kernel finds those "non-TDX-usable memory holes" within a
> given TDMR by walking over a list of "TDX-usable memory regions", which
> essentially reflects the "usable" regions in the e820 table (w/o memory
> hotplug operations precisely, but this is not relevant here).

But including e820 table regions that are not "usable" in the TDMR
reserved areas is not necessary - it is not one of the rules.

What confused me initially was that I did not realize the we already
require that the TDX Module does not touch memory in the TDMR
non-reserved areas not specifically allocated to it.  So it makes no
difference to the TDX Module what the pages that have not been allocated
to it, are used for.

> 
> As shown above, the root cause of this failure is when the kernel tries
> to construct a TDMR to cover address range [0x0, 0x80000000), there
> are too many memory holes within that range and the number of memory
> holes exceeds the maximum number of reserved areas.
> 
> The E820 table of that platform (see [1] below) reflects this: the
> number of memory holes among e820 "usable" entries exceeds 16, which is
> the maximum number of reserved areas TDX module supports in practice.
> 
> === Fix ===
> 
> There are two options to fix this: 1) reduce the number of memory holes
> when constructing a TDMR to save "reserved areas"; 2) reduce the TDMR's
> size to cover fewer memory regions, thus fewer memory holes.

Probably better to try and get rid of this "two options" stuff and focus
on how this is a simple and effective fix.

> 
> Option 1) is possible, and in fact is easier and preferable:
> 
> TDX actually has a concept of "Convertible Memory Regions" (CMRs).  TDX
> reports a list of CMRs that meet TDX's security requirements on memory.
> TDX requires all the "TDX-usable memory regions" that the kernel passes
> to the module via TDMRs, a.k.a, all the "non-reserved regions in TDMRs",
> must be convertible memory.
> 
> In other words, if a memory hole is indeed CMR, then it's not mandatory
> for the kernel to add it to the reserved areas.  By doing so, the number
> of consumed reserved areas can be reduced w/o having any functional
> impact.  The kernel still allocates TDX memory from the page allocator.
> There's no harm if the kernel tells the TDX module some memory regions
> are "TDX-usable" but they will never be allocated by the kernel as TDX
> memory.
> 
> Note this doesn't have any security impact either because the kernel is
> out of TDX's TCB anyway.
> 
> This is feasible because in practice the CMRs just reflect the nature of
> whether the RAM can indeed be used by TDX, thus each CMR tends to be a
> large, uninterrupted range of memory, i.e., unlike the e820 table which
> contains numerous "ACPI *" entries in the first 2G range.  Refer to [2]
> for CMRs reported on the problematic platform using off-tree TDX code.
> 
> So for this particular module initialization failure, the memory holes
> that are within [0x0, 0x80000000) are mostly indeed CMR.  By not adding
> them to the reserved areas, the number of consumed reserved areas for
> the TDMR [0x0, 0x80000000) can be dramatically reduced.
> 
> Option 2) is also theoretically feasible, but it is not desired:
> 
> It requires more complicated logic to handle splitting TDMR into smaller
> ones, which isn't trivial.  There are limitations to splitting TDMR too,
> thus it may not always work: 1) The smallest TDMR is 1GB, and it cannot
> be split any further; 2) This also increases the total number of TDMRs,
> which also has a maximum value limited by the TDX module.
> 
> So, fix this issue by using option 1):
> 
> 1) reading out the CMRs from the TDX module global metadata, and
> 2) changing to find memory holes for a given TDMR based on CMRs, but not
>    based on the list of "TDX-usable memory regions".
> 
> Also dump the CMRs in dmesg.  They are helpful when something goes wrong
> around "constructing the TDMRs and configuring the TDX module with
> them".  Note there are no existing userspace tools that the user can get
> CMRs since they can only be read via SEAMCALL (no CPUID, MSR etc).
> 
> [1] BIOS-E820 table of the problematic platform:
> 
>   BIOS-e820: [mem 0x0000000000000000-0x000000000009efff] usable
>   BIOS-e820: [mem 0x000000000009f000-0x00000000000fffff] reserved
>   BIOS-e820: [mem 0x0000000000100000-0x000000005d168fff] usable
>   BIOS-e820: [mem 0x000000005d169000-0x000000005d22afff] ACPI data
>   BIOS-e820: [mem 0x000000005d22b000-0x000000005d3cefff] usable
>   BIOS-e820: [mem 0x000000005d3cf000-0x000000005d469fff] reserved
>   BIOS-e820: [mem 0x000000005d46a000-0x000000005e5b2fff] usable
>   BIOS-e820: [mem 0x000000005e5b3000-0x000000005e5c2fff] reserved
>   BIOS-e820: [mem 0x000000005e5c3000-0x000000005e5d2fff] usable
>   BIOS-e820: [mem 0x000000005e5d3000-0x000000005e5e4fff] reserved
>   BIOS-e820: [mem 0x000000005e5e5000-0x000000005eb57fff] usable
>   BIOS-e820: [mem 0x000000005eb58000-0x0000000061357fff] ACPI NVS
>   BIOS-e820: [mem 0x0000000061358000-0x000000006172afff] usable
>   BIOS-e820: [mem 0x000000006172b000-0x0000000061794fff] ACPI data
>   BIOS-e820: [mem 0x0000000061795000-0x00000000617fefff] usable
>   BIOS-e820: [mem 0x00000000617ff000-0x0000000061912fff] ACPI data
>   BIOS-e820: [mem 0x0000000061913000-0x0000000061998fff] usable
>   BIOS-e820: [mem 0x0000000061999000-0x00000000619dffff] ACPI data
>   BIOS-e820: [mem 0x00000000619e0000-0x00000000619e1fff] usable
>   BIOS-e820: [mem 0x00000000619e2000-0x00000000619e9fff] reserved
>   BIOS-e820: [mem 0x00000000619ea000-0x0000000061a26fff] usable
>   BIOS-e820: [mem 0x0000000061a27000-0x0000000061baefff] ACPI data
>   BIOS-e820: [mem 0x0000000061baf000-0x00000000623c2fff] usable
>   BIOS-e820: [mem 0x00000000623c3000-0x0000000062471fff] reserved
>   BIOS-e820: [mem 0x0000000062472000-0x0000000062823fff] usable
>   BIOS-e820: [mem 0x0000000062824000-0x0000000063a24fff] reserved
>   BIOS-e820: [mem 0x0000000063a25000-0x0000000063d57fff] usable
>   BIOS-e820: [mem 0x0000000063d58000-0x0000000064157fff] reserved
>   BIOS-e820: [mem 0x0000000064158000-0x0000000064158fff] usable
>   BIOS-e820: [mem 0x0000000064159000-0x0000000064194fff] reserved
>   BIOS-e820: [mem 0x0000000064195000-0x000000006e9cefff] usable
>   BIOS-e820: [mem 0x000000006e9cf000-0x000000006eccefff] reserved
>   BIOS-e820: [mem 0x000000006eccf000-0x000000006f6fefff] ACPI NVS
>   BIOS-e820: [mem 0x000000006f6ff000-0x000000006f7fefff] ACPI data
>   BIOS-e820: [mem 0x000000006f7ff000-0x000000006f7fffff] usable
>   BIOS-e820: [mem 0x000000006f800000-0x000000008fffffff] reserved
>   ......
> 
> [2] Convertible Memory Regions of the problematic platform:
> 
>   virt/tdx: CMR: [0x100000, 0x6f800000)
>   virt/tdx: CMR: [0x100000000, 0x107a000000)
>   virt/tdx: CMR: [0x1080000000, 0x207c000000)
>   virt/tdx: CMR: [0x2080000000, 0x307c000000)
>   virt/tdx: CMR: [0x3080000000, 0x407c000000)
> 
> Fixes: dde3b60d572c9 ("x86/virt/tdx: Designate reserved areas for all TDMRs")
> Signed-off-by: Kai Huang <kai.huang@xxxxxxxxx>
> ---
> 
> v2 -> v3:
> 
>  - Add the Fixes tag, although this patch depends on previous patches.
>  - CMR_BASE0 -> CMR_BASE(_i), CMR_SIZE0 -> CMR_SIZE(_i) to silence the
>    build-check error.
> 
> v1 -> v2:
>  - Change to walk over CMRs directly to find out memory holes, instead
>    of walking over TDX memory blocks and explicitly check whether a hole
>    is subregion of CMR.  (Chao)
>  - Mention any constant macro definitions in global metadata structures
>    are TDX architectural. (Binbin)
>  - Slightly improve the changelog.
> 
> ---
>  arch/x86/virt/vmx/tdx/tdx.c | 105 ++++++++++++++++++++++++++++++------
>  arch/x86/virt/vmx/tdx/tdx.h |  13 +++++
>  2 files changed, 102 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 0fb673dd43ed..fa335ab1ae92 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -331,6 +331,72 @@ static int get_tdx_sys_info_version(struct tdx_sys_info_version *sysinfo_version
>  	return ret;
>  }
>  
> +/* Update the @sysinfo_cmr->num_cmrs to trim tail empty CMRs */
> +static void trim_empty_tail_cmrs(struct tdx_sys_info_cmr *sysinfo_cmr)
> +{
> +	int i;
> +
> +	for (i = 0; i < sysinfo_cmr->num_cmrs; i++) {
> +		u64 cmr_base = sysinfo_cmr->cmr_base[i];
> +		u64 cmr_size = sysinfo_cmr->cmr_size[i];
> +
> +		if (!cmr_size) {
> +			WARN_ON_ONCE(cmr_base);
> +			break;
> +		}
> +
> +		/* TDX architecture: CMR must be 4KB aligned */
> +		WARN_ON_ONCE(!PAGE_ALIGNED(cmr_base) ||
> +				!PAGE_ALIGNED(cmr_size));
> +	}
> +
> +	sysinfo_cmr->num_cmrs = i;
> +}
> +
> +static int get_tdx_sys_info_cmr(struct tdx_sys_info_cmr *sysinfo_cmr)
> +{
> +	int i, ret = 0;
> +
> +#define TD_SYSINFO_MAP_CMR_INFO(_field_id, _member)	\
> +	TD_SYSINFO_MAP(_field_id, sysinfo_cmr, _member)
> +
> +	TD_SYSINFO_MAP_CMR_INFO(NUM_CMRS, num_cmrs);
> +
> +	if (ret)
> +		return ret;
> +
> +	for (i = 0; i < sysinfo_cmr->num_cmrs; i++) {
> +		TD_SYSINFO_MAP_CMR_INFO(CMR_BASE(i), cmr_base[i]);
> +		TD_SYSINFO_MAP_CMR_INFO(CMR_SIZE(i), cmr_size[i]);
> +	}
> +
> +	if (ret)
> +		return ret;
> +
> +	/*
> +	 * The TDX module may just report the maximum number of CMRs that
> +	 * TDX architecturally supports as the actual number of CMRs,
> +	 * despite the latter is smaller.  In this case all the tail
> +	 * CMRs will be empty.  Trim them away.
> +	 */
> +	trim_empty_tail_cmrs(sysinfo_cmr);
> +
> +	return 0;
> +}
> +
> +static void print_sys_info_cmr(struct tdx_sys_info_cmr *sysinfo_cmr)
> +{
> +	int i;
> +
> +	for (i = 0; i < sysinfo_cmr->num_cmrs; i++) {
> +		u64 cmr_base = sysinfo_cmr->cmr_base[i];
> +		u64 cmr_size = sysinfo_cmr->cmr_size[i];
> +
> +		pr_info("CMR[%d]: [0x%llx, 0x%llx)\n", i, cmr_base,
> +				cmr_base + cmr_size);
> +	}
> +}
> +
>  static void print_basic_sys_info(struct tdx_sys_info *sysinfo)
>  {
>  	struct tdx_sys_info_features *features = &sysinfo->features;
> @@ -349,6 +415,8 @@ static void print_basic_sys_info(struct tdx_sys_info *sysinfo)
>  			version->major, version->minor,	version->update,
>  			version->internal, version->build_num,
>  			version->build_date, features->tdx_features0);
> +
> +	print_sys_info_cmr(&sysinfo->cmr);
>  }
>  
>  static int get_tdx_sys_info_tdmr(struct tdx_sys_info_tdmr *sysinfo_tdmr)
> @@ -379,6 +447,10 @@ static int get_tdx_sys_info(struct tdx_sys_info *sysinfo)
>  	if (ret)
>  		return ret;
>  
> +	ret = get_tdx_sys_info_cmr(&sysinfo->cmr);
> +	if (ret)
> +		return ret;
> +
>  	return get_tdx_sys_info_tdmr(&sysinfo->tdmr);
>  }
>  
> @@ -803,29 +875,28 @@ static int tdmr_add_rsvd_area(struct tdmr_info *tdmr, int *p_idx, u64 addr,
>  }
>  
>  /*
> - * Go through @tmb_list to find holes between memory areas.  If any of
> + * Go through all CMRs in @sysinfo_cmr to find memory holes.  If any of
>   * those holes fall within @tdmr, set up a TDMR reserved area to cover
>   * the hole.
>   */
> -static int tdmr_populate_rsvd_holes(struct list_head *tmb_list,
> +static int tdmr_populate_rsvd_holes(struct tdx_sys_info_cmr *sysinfo_cmr,
>  				    struct tdmr_info *tdmr,
>  				    int *rsvd_idx,
>  				    u16 max_reserved_per_tdmr)
>  {
> -	struct tdx_memblock *tmb;
>  	u64 prev_end;
> -	int ret;
> +	int i, ret;
>  
>  	/*
>  	 * Start looking for reserved blocks at the
>  	 * beginning of the TDMR.
>  	 */
>  	prev_end = tdmr->base;
> -	list_for_each_entry(tmb, tmb_list, list) {
> +	for (i = 0; i < sysinfo_cmr->num_cmrs; i++) {
>  		u64 start, end;
>  
> -		start = PFN_PHYS(tmb->start_pfn);
> -		end   = PFN_PHYS(tmb->end_pfn);
> +		start = sysinfo_cmr->cmr_base[i];
> +		end   = start + sysinfo_cmr->cmr_size[i];
>  
>  		/* Break if this region is after the TDMR */
>  		if (start >= tdmr_end(tdmr))
> @@ -926,16 +997,16 @@ static int rsvd_area_cmp_func(const void *a, const void *b)
>  
>  /*
>   * Populate reserved areas for the given @tdmr, including memory holes
> - * (via @tmb_list) and PAMTs (via @tdmr_list).
> + * (via @sysinfo_cmr) and PAMTs (via @tdmr_list).
>   */
>  static int tdmr_populate_rsvd_areas(struct tdmr_info *tdmr,
> -				    struct list_head *tmb_list,
> +				    struct tdx_sys_info_cmr *sysinfo_cmr,
>  				    struct tdmr_info_list *tdmr_list,
>  				    u16 max_reserved_per_tdmr)
>  {
>  	int ret, rsvd_idx = 0;
>  
> -	ret = tdmr_populate_rsvd_holes(tmb_list, tdmr, &rsvd_idx,
> +	ret = tdmr_populate_rsvd_holes(sysinfo_cmr, tdmr, &rsvd_idx,
>  			max_reserved_per_tdmr);
>  	if (ret)
>  		return ret;
> @@ -954,10 +1025,10 @@ static int tdmr_populate_rsvd_areas(struct tdmr_info *tdmr,
>  
>  /*
>   * Populate reserved areas for all TDMRs in @tdmr_list, including memory
> - * holes (via @tmb_list) and PAMTs.
> + * holes (via @sysinfo_cmr) and PAMTs.
>   */
>  static int tdmrs_populate_rsvd_areas_all(struct tdmr_info_list *tdmr_list,
> -					 struct list_head *tmb_list,
> +					 struct tdx_sys_info_cmr *sysinfo_cmr,
>  					 u16 max_reserved_per_tdmr)
>  {
>  	int i;
> @@ -966,7 +1037,7 @@ static int tdmrs_populate_rsvd_areas_all(struct tdmr_info_list *tdmr_list,
>  		int ret;
>  
>  		ret = tdmr_populate_rsvd_areas(tdmr_entry(tdmr_list, i),
> -				tmb_list, tdmr_list, max_reserved_per_tdmr);
> +				sysinfo_cmr, tdmr_list, max_reserved_per_tdmr);
>  		if (ret)
>  			return ret;
>  	}
> @@ -981,7 +1052,8 @@ static int tdmrs_populate_rsvd_areas_all(struct tdmr_info_list *tdmr_list,
>   */
>  static int construct_tdmrs(struct list_head *tmb_list,
>  			   struct tdmr_info_list *tdmr_list,
> -			   struct tdx_sys_info_tdmr *sysinfo_tdmr)
> +			   struct tdx_sys_info_tdmr *sysinfo_tdmr,
> +			   struct tdx_sys_info_cmr *sysinfo_cmr)
>  {
>  	int ret;
>  
> @@ -994,7 +1066,7 @@ static int construct_tdmrs(struct list_head *tmb_list,
>  	if (ret)
>  		return ret;
>  
> -	ret = tdmrs_populate_rsvd_areas_all(tdmr_list, tmb_list,
> +	ret = tdmrs_populate_rsvd_areas_all(tdmr_list, sysinfo_cmr,
>  			sysinfo_tdmr->max_reserved_per_tdmr);
>  	if (ret)
>  		tdmrs_free_pamt_all(tdmr_list);
> @@ -1185,7 +1257,8 @@ static int init_tdx_module(void)
>  		goto err_free_tdxmem;
>  
>  	/* Cover all TDX-usable memory regions in TDMRs */
> -	ret = construct_tdmrs(&tdx_memlist, &tdx_tdmr_list, &sysinfo.tdmr);
> +	ret = construct_tdmrs(&tdx_memlist, &tdx_tdmr_list, &sysinfo.tdmr,
> +			&sysinfo.cmr);
>  	if (ret)
>  		goto err_free_tdmrs;
>  
> diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> index b422e8517e01..e7bed9e717c7 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.h
> +++ b/arch/x86/virt/vmx/tdx/tdx.h
> @@ -40,6 +40,10 @@
>  #define MD_FIELD_ID_UPDATE_VERSION		0x0800000100000005ULL
>  #define MD_FIELD_ID_INTERNAL_VERSION		0x0800000100000006ULL
>  
> +#define MD_FIELD_ID_NUM_CMRS			0x9000000100000000ULL
> +#define MD_FIELD_ID_CMR_BASE(_i)		(0x9000000300000080ULL + (u16)_i)
> +#define MD_FIELD_ID_CMR_SIZE(_i)		(0x9000000300000100ULL + (u16)_i)
> +
>  #define MD_FIELD_ID_MAX_TDMRS			0x9100000100000008ULL
>  #define MD_FIELD_ID_MAX_RESERVED_PER_TDMR	0x9100000100000009ULL
>  #define MD_FIELD_ID_PAMT_4K_ENTRY_SIZE		0x9100000100000010ULL
> @@ -160,6 +164,14 @@ struct tdx_sys_info_version {
>  	u32 build_date;
>  };
>  
> +/* Class "CMR Info" */
> +#define TDX_MAX_CMRS	32	/* architectural */
> +struct tdx_sys_info_cmr {
> +	u16 num_cmrs;
> +	u64 cmr_base[TDX_MAX_CMRS];
> +	u64 cmr_size[TDX_MAX_CMRS];
> +};
> +
>  /* Class "TDMR info" */
>  struct tdx_sys_info_tdmr {
>  	u16 max_tdmrs;
> @@ -170,6 +182,7 @@ struct tdx_sys_info_tdmr {
>  struct tdx_sys_info {
>  	struct tdx_sys_info_features	features;
>  	struct tdx_sys_info_version	version;
> +	struct tdx_sys_info_cmr		cmr;
>  	struct tdx_sys_info_tdmr	tdmr;
>  };
>  





[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux