Re: [PATCH v3 13/21] x86/virt/tdx: Allocate and set up PAMTs for TDMRs

Kai Huang <kai.huang@xxxxxxxxx> · Fri, 29 Apr 2022 19:46:05 +1200

On Thu, 2022-04-28 at 10:12 -0700, Dave Hansen wrote:
> On 4/5/22 21:49, Kai Huang wrote:
> > In order to provide crypto protection to guests, the TDX module uses
> > additional metadata to record things like which guest "owns" a given
> > page of memory.  This metadata, referred as Physical Address Metadata
> > Table (PAMT), essentially serves as the 'struct page' for the TDX
> > module.  PAMTs are not reserved by hardware upfront.  They must be
> > allocated by the kernel and then given to the TDX module.
> > 
> > TDX supports 3 page sizes: 4K, 2M, and 1G.  Each "TD Memory Region"
> > (TDMR) has 3 PAMTs to track the 3 supported page sizes respectively.
> 
> s/respectively//
> 

Will remove.

> > Each PAMT must be a physically contiguous area from the Convertible
> 
> 							^ s/the/a/

OK.

> 
> > Memory Regions (CMR).  However, the PAMTs which track pages in one TDMR
> > do not need to reside within that TDMR but can be anywhere in CMRs.
> > If one PAMT overlaps with any TDMR, the overlapping part must be
> > reported as a reserved area in that particular TDMR.
> > 
> > Use alloc_contig_pages() since PAMT must be a physically contiguous area
> > and it may be potentially large (~1/256th of the size of the given TDMR).
> 
> This is also a good place to note the downsides of using
> alloc_contig_pages().

For instance:

	The allocation may fail when memory usage is under pressure.

?

> 
> > The current version of TDX supports at most 16 reserved areas per TDMR
> > to cover both PAMTs and potential memory holes within the TDMR.  If many
> > PAMTs are allocated within a single TDMR, 16 reserved areas may not be
> > sufficient to cover all of them.
> > 
> > Adopt the following policies when allocating PAMTs for a given TDMR:
> > 
> >   - Allocate three PAMTs of the TDMR in one contiguous chunk to minimize
> >     the total number of reserved areas consumed for PAMTs.
> >   - Try to first allocate PAMT from the local node of the TDMR for better
> >     NUMA locality.
> > 
> > Signed-off-by: Kai Huang <kai.huang@xxxxxxxxx>
> > ---
> >  arch/x86/Kconfig            |   1 +
> >  arch/x86/virt/vmx/tdx/tdx.c | 165 ++++++++++++++++++++++++++++++++++++
> >  2 files changed, 166 insertions(+)
> > 
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index 7414625b938f..ff68d0829bd7 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -1973,6 +1973,7 @@ config INTEL_TDX_HOST
> >  	depends on CPU_SUP_INTEL
> >  	depends on X86_64
> >  	select NUMA_KEEP_MEMINFO if NUMA
> > +	depends on CONTIG_ALLOC
> >  	help
> >  	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
> >  	  host and certain physical attacks.  This option enables necessary TDX
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > index 82534e70df96..1b807dcbc101 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -21,6 +21,7 @@
> >  #include <asm/cpufeatures.h>
> >  #include <asm/virtext.h>
> >  #include <asm/e820/api.h>
> > +#include <asm/pgtable.h>
> >  #include <asm/tdx.h>
> >  #include "tdx.h"
> >  
> > @@ -66,6 +67,16 @@
> >  #define TDMR_START(_tdmr)	((_tdmr)->base)
> >  #define TDMR_END(_tdmr)		((_tdmr)->base + (_tdmr)->size)
> >  
> > +/* Page sizes supported by TDX */
> > +enum tdx_page_sz {
> > +	TDX_PG_4K = 0,
> > +	TDX_PG_2M,
> > +	TDX_PG_1G,
> > +	TDX_PG_MAX,
> > +};
> 
> Is that =0 required?  I thought the first enum was defined to be 0.

No it's not required.  Will remove.

> 
> > +#define TDX_HPAGE_SHIFT	9
> > +
> >  /*
> >   * TDX module status during initialization
> >   */
> > @@ -959,6 +970,148 @@ static int create_tdmrs(struct tdmr_info **tdmr_array, int *tdmr_num)
> >  	return ret;
> >  }
> >  
> > +/* Calculate PAMT size given a TDMR and a page size */
> > +static unsigned long __tdmr_get_pamt_sz(struct tdmr_info *tdmr,
> > +					enum tdx_page_sz pgsz)
> > +{
> > +	unsigned long pamt_sz;
> > +
> > +	pamt_sz = (tdmr->size >> ((TDX_HPAGE_SHIFT * pgsz) + PAGE_SHIFT)) *
> > +		tdx_sysinfo.pamt_entry_size;
> 
> That 'pgsz' thing is just hideous.  I'd *much* rather see something like
> this:
> 
> static int tdx_page_size_shift(enum tdx_page_sz page_sz)
> {
> 	switch (page_sz) {
> 	case TDX_PG_4K:
> 		return PAGE_SIZE;
> 	...
> 	}
> }
> 
> That's easy to figure out what's going on.

OK. Will do.

> 
> > +	/* PAMT size must be 4K aligned */
> > +	pamt_sz = ALIGN(pamt_sz, PAGE_SIZE);
> > +
> > +	return pamt_sz;
> > +}
> > +
> > +/* Calculate the size of all PAMTs for a TDMR */
> > +static unsigned long tdmr_get_pamt_sz(struct tdmr_info *tdmr)
> > +{
> > +	enum tdx_page_sz pgsz;
> > +	unsigned long pamt_sz;
> > +
> > +	pamt_sz = 0;
> > +	for (pgsz = TDX_PG_4K; pgsz < TDX_PG_MAX; pgsz++)
> > +		pamt_sz += __tdmr_get_pamt_sz(tdmr, pgsz);
> > +
> > +	return pamt_sz;
> > +}
> 
> But, there are 3 separate pointers pointing to 3 separate PAMTs.  Why do
> they all have to be contiguously allocated?

It is also explained in the changelog (the last two paragraphs).

> 
> > +/*
> > + * Locate the NUMA node containing the start of the given TDMR's first
> > + * RAM entry.  The given TDMR may also cover memory in other NUMA nodes.
> > + */
> 
> Please add a sentence or two on the implications here of what this means
> when it happens.  Also, the joining of e820 regions seems like it might
> span NUMA nodes.  What prevents that code from just creating one large
> e820 area that leads to one large TDMR and horrible NUMA affinity for
> these structures?

How about adding:

	When TDMR is created, it stops spanning at NUAM boundary.

> 
> > +static int tdmr_get_nid(struct tdmr_info *tdmr)
> > +{
> > +	u64 start, end;
> > +	int i;
> > +
> > +	/* Find the first RAM entry covered by the TDMR */
> > +	e820_for_each_mem(i, start, end)
> > +		if (end > TDMR_START(tdmr))
> > +			break;
> 
> Brackets around the big loop, please.

OK.

> 
> > +	/*
> > +	 * One TDMR must cover at least one (or partial) RAM entry,
> > +	 * otherwise it is kernel bug.  WARN_ON() in this case.
> > +	 */
> > +	if (WARN_ON_ONCE((start >= end) || start >= TDMR_END(tdmr)))
> > +		return 0;
> > +
> > +	/*
> > +	 * The first RAM entry may be partially covered by the previous
> > +	 * TDMR.  In this case, use TDMR's start to find the NUMA node.
> > +	 */
> > +	if (start < TDMR_START(tdmr))
> > +		start = TDMR_START(tdmr);
> > +
> > +	return phys_to_target_node(start);
> > +}
> > +
> > +static int tdmr_setup_pamt(struct tdmr_info *tdmr)
> > +{
> > +	unsigned long tdmr_pamt_base, pamt_base[TDX_PG_MAX];
> > +	unsigned long pamt_sz[TDX_PG_MAX];
> > +	unsigned long pamt_npages;
> > +	struct page *pamt;
> > +	enum tdx_page_sz pgsz;
> > +	int nid;
> 
> Sooooooooooooooooooo close to reverse Christmas tree, but no cigar.
> Please fix it.

Will fix.  Thanks.

> 
> > +	/*
> > +	 * Allocate one chunk of physically contiguous memory for all
> > +	 * PAMTs.  This helps minimize the PAMT's use of reserved areas
> > +	 * in overlapped TDMRs.
> > +	 */
> 
> Ahh, this explains it.  Considering that tdmr_get_pamt_sz() is really
> just two lines of code, I'd probably just the helper and open-code it
> here.  Then you only have one place to comment on it.

It has a loop and internally calls __tdmr_get_pamt_sz().  It looks doesn't fit
if we open-code it here.

How about move this comment to tdmr_get_pamt_sz()?

> 
> > +	nid = tdmr_get_nid(tdmr);
> > +	pamt_npages = tdmr_get_pamt_sz(tdmr) >> PAGE_SHIFT;
> > +	pamt = alloc_contig_pages(pamt_npages, GFP_KERNEL, nid,
> > +			&node_online_map);
> > +	if (!pamt)
> > +		return -ENOMEM;
> > +
> > +	/* Calculate PAMT base and size for all supported page sizes. */
> > +	tdmr_pamt_base = page_to_pfn(pamt) << PAGE_SHIFT;
> > +	for (pgsz = TDX_PG_4K; pgsz < TDX_PG_MAX; pgsz++) {
> > +		unsigned long sz = __tdmr_get_pamt_sz(tdmr, pgsz);
> > +
> > +		pamt_base[pgsz] = tdmr_pamt_base;
> > +		pamt_sz[pgsz] = sz;
> > +
> > +		tdmr_pamt_base += sz;
> > +	}
> > +
> > +	tdmr->pamt_4k_base = pamt_base[TDX_PG_4K];
> > +	tdmr->pamt_4k_size = pamt_sz[TDX_PG_4K];
> > +	tdmr->pamt_2m_base = pamt_base[TDX_PG_2M];
> > +	tdmr->pamt_2m_size = pamt_sz[TDX_PG_2M];
> > +	tdmr->pamt_1g_base = pamt_base[TDX_PG_1G];
> > +	tdmr->pamt_1g_size = pamt_sz[TDX_PG_1G];
> 
> This would all vertically align nicely if you renamed pamt_sz -> pamt_size.

OK.

> 
> > +	return 0;
> > +}
> > +
> > +static void tdmr_free_pamt(struct tdmr_info *tdmr)
> > +{
> > +	unsigned long pamt_pfn, pamt_sz;
> > +
> > +	pamt_pfn = tdmr->pamt_4k_base >> PAGE_SHIFT;
> 
> Comment, please:
> 
> 	/*
> 	 * The PAMT was allocated in one contiguous unit.  The 4k PAMT
> 	 * should always point to the beginning of that allocation.
> 	 */

Thanks will add.

> 
> > +	pamt_sz = tdmr->pamt_4k_size + tdmr->pamt_2m_size + tdmr->pamt_1g_size;
> > +
> > +	/* Do nothing if PAMT hasn't been allocated for this TDMR */
> > +	if (!pamt_sz)
> > +		return;
> > +
> > +	if (WARN_ON(!pamt_pfn))
> > +		return;
> > +
> > +	free_contig_range(pamt_pfn, pamt_sz >> PAGE_SHIFT);
> > +}
> > +
> > +static void tdmrs_free_pamt_all(struct tdmr_info **tdmr_array, int tdmr_num)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < tdmr_num; i++)
> > +		tdmr_free_pamt(tdmr_array[i]);
> > +}
> > +
> > +/* Allocate and set up PAMTs for all TDMRs */
> > +static int tdmrs_setup_pamt_all(struct tdmr_info **tdmr_array, int tdmr_num)
> 
> 	"set_up", please, not "setup".

OK.

> 
> > +{
> > +	int i, ret;
> > +
> > +	for (i = 0; i < tdmr_num; i++) {
> > +		ret = tdmr_setup_pamt(tdmr_array[i]);
> > +		if (ret)
> > +			goto err;
> > +	}
> > +
> > +	return 0;
> > +err:
> > +	tdmrs_free_pamt_all(tdmr_array, tdmr_num);
> > +	return -ENOMEM;
> > +}
> > +
> >  static int construct_tdmrs(struct tdmr_info **tdmr_array, int *tdmr_num)
> >  {
> >  	int ret;
> > @@ -971,8 +1124,14 @@ static int construct_tdmrs(struct tdmr_info **tdmr_array, int *tdmr_num)
> >  	if (ret)
> >  		goto err;
> >  
> > +	ret = tdmrs_setup_pamt_all(tdmr_array, *tdmr_num);
> > +	if (ret)
> > +		goto err_free_tdmrs;
> > +
> >  	/* Return -EFAULT until constructing TDMRs is done */
> >  	ret = -EFAULT;
> > +	tdmrs_free_pamt_all(tdmr_array, *tdmr_num);
> > +err_free_tdmrs:
> >  	free_tdmrs(tdmr_array, *tdmr_num);
> >  err:
> >  	return ret;
> > @@ -1022,6 +1181,12 @@ static int init_tdx_module(void)
> >  	 * initialization are done.
> >  	 */
> >  	ret = -EFAULT;
> > +	/*
> > +	 * Free PAMTs allocated in construct_tdmrs() when TDX module
> > +	 * initialization fails.
> > +	 */
> > +	if (ret)
> > +		tdmrs_free_pamt_all(tdmr_array, tdmr_num);
> >  out_free_tdmrs:
> >  	/*
> >  	 * TDMRs are only used during initializing TDX module.  Always
> 
> In a follow-on patch, I'd like this to dump out (in a pr_debug() or
> pr_info()) how much memory is consumed by PAMT allocations.

OK.