On Thu, 2022-04-28 at 10:12 -0700, Dave Hansen wrote: > On 4/5/22 21:49, Kai Huang wrote: > > In order to provide crypto protection to guests, the TDX module uses > > additional metadata to record things like which guest "owns" a given > > page of memory. This metadata, referred as Physical Address Metadata > > Table (PAMT), essentially serves as the 'struct page' for the TDX > > module. PAMTs are not reserved by hardware upfront. They must be > > allocated by the kernel and then given to the TDX module. > > > > TDX supports 3 page sizes: 4K, 2M, and 1G. Each "TD Memory Region" > > (TDMR) has 3 PAMTs to track the 3 supported page sizes respectively. > > s/respectively// > Will remove. > > Each PAMT must be a physically contiguous area from the Convertible > > ^ s/the/a/ OK. > > > Memory Regions (CMR). However, the PAMTs which track pages in one TDMR > > do not need to reside within that TDMR but can be anywhere in CMRs. > > If one PAMT overlaps with any TDMR, the overlapping part must be > > reported as a reserved area in that particular TDMR. > > > > Use alloc_contig_pages() since PAMT must be a physically contiguous area > > and it may be potentially large (~1/256th of the size of the given TDMR). > > This is also a good place to note the downsides of using > alloc_contig_pages(). For instance: The allocation may fail when memory usage is under pressure. ? > > > The current version of TDX supports at most 16 reserved areas per TDMR > > to cover both PAMTs and potential memory holes within the TDMR. If many > > PAMTs are allocated within a single TDMR, 16 reserved areas may not be > > sufficient to cover all of them. > > > > Adopt the following policies when allocating PAMTs for a given TDMR: > > > > - Allocate three PAMTs of the TDMR in one contiguous chunk to minimize > > the total number of reserved areas consumed for PAMTs. > > - Try to first allocate PAMT from the local node of the TDMR for better > > NUMA locality. > > > > Signed-off-by: Kai Huang <kai.huang@xxxxxxxxx> > > --- > > arch/x86/Kconfig | 1 + > > arch/x86/virt/vmx/tdx/tdx.c | 165 ++++++++++++++++++++++++++++++++++++ > > 2 files changed, 166 insertions(+) > > > > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig > > index 7414625b938f..ff68d0829bd7 100644 > > --- a/arch/x86/Kconfig > > +++ b/arch/x86/Kconfig > > @@ -1973,6 +1973,7 @@ config INTEL_TDX_HOST > > depends on CPU_SUP_INTEL > > depends on X86_64 > > select NUMA_KEEP_MEMINFO if NUMA > > + depends on CONTIG_ALLOC > > help > > Intel Trust Domain Extensions (TDX) protects guest VMs from malicious > > host and certain physical attacks. This option enables necessary TDX > > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c > > index 82534e70df96..1b807dcbc101 100644 > > --- a/arch/x86/virt/vmx/tdx/tdx.c > > +++ b/arch/x86/virt/vmx/tdx/tdx.c > > @@ -21,6 +21,7 @@ > > #include <asm/cpufeatures.h> > > #include <asm/virtext.h> > > #include <asm/e820/api.h> > > +#include <asm/pgtable.h> > > #include <asm/tdx.h> > > #include "tdx.h" > > > > @@ -66,6 +67,16 @@ > > #define TDMR_START(_tdmr) ((_tdmr)->base) > > #define TDMR_END(_tdmr) ((_tdmr)->base + (_tdmr)->size) > > > > +/* Page sizes supported by TDX */ > > +enum tdx_page_sz { > > + TDX_PG_4K = 0, > > + TDX_PG_2M, > > + TDX_PG_1G, > > + TDX_PG_MAX, > > +}; > > Is that =0 required? I thought the first enum was defined to be 0. No it's not required. Will remove. > > > +#define TDX_HPAGE_SHIFT 9 > > + > > /* > > * TDX module status during initialization > > */ > > @@ -959,6 +970,148 @@ static int create_tdmrs(struct tdmr_info **tdmr_array, int *tdmr_num) > > return ret; > > } > > > > +/* Calculate PAMT size given a TDMR and a page size */ > > +static unsigned long __tdmr_get_pamt_sz(struct tdmr_info *tdmr, > > + enum tdx_page_sz pgsz) > > +{ > > + unsigned long pamt_sz; > > + > > + pamt_sz = (tdmr->size >> ((TDX_HPAGE_SHIFT * pgsz) + PAGE_SHIFT)) * > > + tdx_sysinfo.pamt_entry_size; > > That 'pgsz' thing is just hideous. I'd *much* rather see something like > this: > > static int tdx_page_size_shift(enum tdx_page_sz page_sz) > { > switch (page_sz) { > case TDX_PG_4K: > return PAGE_SIZE; > ... > } > } > > That's easy to figure out what's going on. OK. Will do. > > > + /* PAMT size must be 4K aligned */ > > + pamt_sz = ALIGN(pamt_sz, PAGE_SIZE); > > + > > + return pamt_sz; > > +} > > + > > +/* Calculate the size of all PAMTs for a TDMR */ > > +static unsigned long tdmr_get_pamt_sz(struct tdmr_info *tdmr) > > +{ > > + enum tdx_page_sz pgsz; > > + unsigned long pamt_sz; > > + > > + pamt_sz = 0; > > + for (pgsz = TDX_PG_4K; pgsz < TDX_PG_MAX; pgsz++) > > + pamt_sz += __tdmr_get_pamt_sz(tdmr, pgsz); > > + > > + return pamt_sz; > > +} > > But, there are 3 separate pointers pointing to 3 separate PAMTs. Why do > they all have to be contiguously allocated? It is also explained in the changelog (the last two paragraphs). > > > +/* > > + * Locate the NUMA node containing the start of the given TDMR's first > > + * RAM entry. The given TDMR may also cover memory in other NUMA nodes. > > + */ > > Please add a sentence or two on the implications here of what this means > when it happens. Also, the joining of e820 regions seems like it might > span NUMA nodes. What prevents that code from just creating one large > e820 area that leads to one large TDMR and horrible NUMA affinity for > these structures? How about adding: When TDMR is created, it stops spanning at NUAM boundary. > > > +static int tdmr_get_nid(struct tdmr_info *tdmr) > > +{ > > + u64 start, end; > > + int i; > > + > > + /* Find the first RAM entry covered by the TDMR */ > > + e820_for_each_mem(i, start, end) > > + if (end > TDMR_START(tdmr)) > > + break; > > Brackets around the big loop, please. OK. > > > + /* > > + * One TDMR must cover at least one (or partial) RAM entry, > > + * otherwise it is kernel bug. WARN_ON() in this case. > > + */ > > + if (WARN_ON_ONCE((start >= end) || start >= TDMR_END(tdmr))) > > + return 0; > > + > > + /* > > + * The first RAM entry may be partially covered by the previous > > + * TDMR. In this case, use TDMR's start to find the NUMA node. > > + */ > > + if (start < TDMR_START(tdmr)) > > + start = TDMR_START(tdmr); > > + > > + return phys_to_target_node(start); > > +} > > + > > +static int tdmr_setup_pamt(struct tdmr_info *tdmr) > > +{ > > + unsigned long tdmr_pamt_base, pamt_base[TDX_PG_MAX]; > > + unsigned long pamt_sz[TDX_PG_MAX]; > > + unsigned long pamt_npages; > > + struct page *pamt; > > + enum tdx_page_sz pgsz; > > + int nid; > > Sooooooooooooooooooo close to reverse Christmas tree, but no cigar. > Please fix it. Will fix. Thanks. > > > + /* > > + * Allocate one chunk of physically contiguous memory for all > > + * PAMTs. This helps minimize the PAMT's use of reserved areas > > + * in overlapped TDMRs. > > + */ > > Ahh, this explains it. Considering that tdmr_get_pamt_sz() is really > just two lines of code, I'd probably just the helper and open-code it > here. Then you only have one place to comment on it. It has a loop and internally calls __tdmr_get_pamt_sz(). It looks doesn't fit if we open-code it here. How about move this comment to tdmr_get_pamt_sz()? > > > + nid = tdmr_get_nid(tdmr); > > + pamt_npages = tdmr_get_pamt_sz(tdmr) >> PAGE_SHIFT; > > + pamt = alloc_contig_pages(pamt_npages, GFP_KERNEL, nid, > > + &node_online_map); > > + if (!pamt) > > + return -ENOMEM; > > + > > + /* Calculate PAMT base and size for all supported page sizes. */ > > + tdmr_pamt_base = page_to_pfn(pamt) << PAGE_SHIFT; > > + for (pgsz = TDX_PG_4K; pgsz < TDX_PG_MAX; pgsz++) { > > + unsigned long sz = __tdmr_get_pamt_sz(tdmr, pgsz); > > + > > + pamt_base[pgsz] = tdmr_pamt_base; > > + pamt_sz[pgsz] = sz; > > + > > + tdmr_pamt_base += sz; > > + } > > + > > + tdmr->pamt_4k_base = pamt_base[TDX_PG_4K]; > > + tdmr->pamt_4k_size = pamt_sz[TDX_PG_4K]; > > + tdmr->pamt_2m_base = pamt_base[TDX_PG_2M]; > > + tdmr->pamt_2m_size = pamt_sz[TDX_PG_2M]; > > + tdmr->pamt_1g_base = pamt_base[TDX_PG_1G]; > > + tdmr->pamt_1g_size = pamt_sz[TDX_PG_1G]; > > This would all vertically align nicely if you renamed pamt_sz -> pamt_size. OK. > > > + return 0; > > +} > > + > > +static void tdmr_free_pamt(struct tdmr_info *tdmr) > > +{ > > + unsigned long pamt_pfn, pamt_sz; > > + > > + pamt_pfn = tdmr->pamt_4k_base >> PAGE_SHIFT; > > Comment, please: > > /* > * The PAMT was allocated in one contiguous unit. The 4k PAMT > * should always point to the beginning of that allocation. > */ Thanks will add. > > > + pamt_sz = tdmr->pamt_4k_size + tdmr->pamt_2m_size + tdmr->pamt_1g_size; > > + > > + /* Do nothing if PAMT hasn't been allocated for this TDMR */ > > + if (!pamt_sz) > > + return; > > + > > + if (WARN_ON(!pamt_pfn)) > > + return; > > + > > + free_contig_range(pamt_pfn, pamt_sz >> PAGE_SHIFT); > > +} > > + > > +static void tdmrs_free_pamt_all(struct tdmr_info **tdmr_array, int tdmr_num) > > +{ > > + int i; > > + > > + for (i = 0; i < tdmr_num; i++) > > + tdmr_free_pamt(tdmr_array[i]); > > +} > > + > > +/* Allocate and set up PAMTs for all TDMRs */ > > +static int tdmrs_setup_pamt_all(struct tdmr_info **tdmr_array, int tdmr_num) > > "set_up", please, not "setup". OK. > > > +{ > > + int i, ret; > > + > > + for (i = 0; i < tdmr_num; i++) { > > + ret = tdmr_setup_pamt(tdmr_array[i]); > > + if (ret) > > + goto err; > > + } > > + > > + return 0; > > +err: > > + tdmrs_free_pamt_all(tdmr_array, tdmr_num); > > + return -ENOMEM; > > +} > > + > > static int construct_tdmrs(struct tdmr_info **tdmr_array, int *tdmr_num) > > { > > int ret; > > @@ -971,8 +1124,14 @@ static int construct_tdmrs(struct tdmr_info **tdmr_array, int *tdmr_num) > > if (ret) > > goto err; > > > > + ret = tdmrs_setup_pamt_all(tdmr_array, *tdmr_num); > > + if (ret) > > + goto err_free_tdmrs; > > + > > /* Return -EFAULT until constructing TDMRs is done */ > > ret = -EFAULT; > > + tdmrs_free_pamt_all(tdmr_array, *tdmr_num); > > +err_free_tdmrs: > > free_tdmrs(tdmr_array, *tdmr_num); > > err: > > return ret; > > @@ -1022,6 +1181,12 @@ static int init_tdx_module(void) > > * initialization are done. > > */ > > ret = -EFAULT; > > + /* > > + * Free PAMTs allocated in construct_tdmrs() when TDX module > > + * initialization fails. > > + */ > > + if (ret) > > + tdmrs_free_pamt_all(tdmr_array, tdmr_num); > > out_free_tdmrs: > > /* > > * TDMRs are only used during initializing TDX module. Always > > In a follow-on patch, I'd like this to dump out (in a pr_debug() or > pr_info()) how much memory is consumed by PAMT allocations. OK.