Re: [PATCH v12 12/22] x86/virt/tdx: Allocate and set up PAMTs for TDMRs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 26.06.23 16:12, Kai Huang wrote:
The TDX module uses additional metadata to record things like which
guest "owns" a given page of memory.  This metadata, referred as
Physical Address Metadata Table (PAMT), essentially serves as the
'struct page' for the TDX module.  PAMTs are not reserved by hardware
up front.  They must be allocated by the kernel and then given to the
TDX module during module initialization.

TDX supports 3 page sizes: 4K, 2M, and 1G.  Each "TD Memory Region"
(TDMR) has 3 PAMTs to track the 3 supported page sizes.  Each PAMT must
be a physically contiguous area from a Convertible Memory Region (CMR).
However, the PAMTs which track pages in one TDMR do not need to reside
within that TDMR but can be anywhere in CMRs.  If one PAMT overlaps with
any TDMR, the overlapping part must be reported as a reserved area in
that particular TDMR.

Use alloc_contig_pages() since PAMT must be a physically contiguous area
and it may be potentially large (~1/256th of the size of the given TDMR).
The downside is alloc_contig_pages() may fail at runtime.  One (bad)
mitigation is to launch a TDX guest early during system boot to get
those PAMTs allocated at early time, but the only way to fix is to add a
boot option to allocate or reserve PAMTs during kernel boot.

It is imperfect but will be improved on later.

TDX only supports a limited number of reserved areas per TDMR to cover
both PAMTs and memory holes within the given TDMR.  If many PAMTs are
allocated within a single TDMR, the reserved areas may not be sufficient
to cover all of them.

Adopt the following policies when allocating PAMTs for a given TDMR:

   - Allocate three PAMTs of the TDMR in one contiguous chunk to minimize
     the total number of reserved areas consumed for PAMTs.
   - Try to first allocate PAMT from the local node of the TDMR for better
     NUMA locality.

Also dump out how many pages are allocated for PAMTs when the TDX module
is initialized successfully.  This helps answer the eternal "where did
all my memory go?" questions.

Signed-off-by: Kai Huang <kai.huang@xxxxxxxxx>
Reviewed-by: Isaku Yamahata <isaku.yamahata@xxxxxxxxx>
Reviewed-by: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>
---

v11 -> v12:
  - Moved TDX_PS_NUM from tdx.c to <asm/tdx.h> (Kirill)
  - "<= TDX_PS_1G" -> "< TDX_PS_NUM" (Kirill)
  - Changed tdmr_get_pamt() to return base and size instead of base_pfn
    and npages and related code directly (Dave).
  - Simplified PAMT kb counting. (Dave)
  - tdmrs_count_pamt_pages() -> tdmr_count_pamt_kb() (Kirill/Dave)

v10 -> v11:
  - No update

v9 -> v10:
  - Removed code change in disable_tdx_module() as it doesn't exist
    anymore.

v8 -> v9:
  - Added TDX_PS_NR macro instead of open-coding (Dave).
  - Better alignment of 'pamt_entry_size' in tdmr_set_up_pamt() (Dave).
  - Changed to print out PAMTs in "KBs" instead of "pages" (Dave).
  - Added Dave's Reviewed-by.

v7 -> v8: (Dave)
  - Changelog:
   - Added a sentence to state PAMT allocation will be improved.
   - Others suggested by Dave.
  - Moved 'nid' of 'struct tdx_memblock' to this patch.
  - Improved comments around tdmr_get_nid().
  - WARN_ON_ONCE() -> pr_warn() in tdmr_get_nid().
  - Other changes due to 'struct tdmr_info_list'.

v6 -> v7:
  - Changes due to using macros instead of 'enum' for TDX supported page
    sizes.

v5 -> v6:
  - Rebase due to using 'tdx_memblock' instead of memblock.
  - 'int pamt_entry_nr' -> 'unsigned long nr_pamt_entries' (Dave/Sagis).
  - Improved comment around tdmr_get_nid() (Dave).
  - Improved comment in tdmr_set_up_pamt() around breaking the PAMT
    into PAMTs for 4K/2M/1G (Dave).
  - tdmrs_get_pamt_pages() -> tdmrs_count_pamt_pages() (Dave).

- v3 -> v5 (no feedback on v4):
  - Used memblock to get the NUMA node for given TDMR.
  - Removed tdmr_get_pamt_sz() helper but use open-code instead.
  - Changed to use 'switch .. case..' for each TDX supported page size in
    tdmr_get_pamt_sz() (the original __tdmr_get_pamt_sz()).
  - Added printing out memory used for PAMT allocation when TDX module is
    initialized successfully.
  - Explained downside of alloc_contig_pages() in changelog.
  - Addressed other minor comments.


---
  arch/x86/Kconfig            |   1 +
  arch/x86/include/asm/tdx.h  |   1 +
  arch/x86/virt/vmx/tdx/tdx.c | 215 +++++++++++++++++++++++++++++++++++-
  arch/x86/virt/vmx/tdx/tdx.h |   1 +
  4 files changed, 213 insertions(+), 5 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2226d8a4c749..ad364f01de33 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1959,6 +1959,7 @@ config INTEL_TDX_HOST
  	depends on KVM_INTEL
  	depends on X86_X2APIC
  	select ARCH_KEEP_MEMBLOCK
+	depends on CONTIG_ALLOC
  	help
  	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
  	  host and certain physical attacks.  This option enables necessary TDX
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index d8226a50c58c..91416fd600cd 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -24,6 +24,7 @@
  #define TDX_PS_4K	0
  #define TDX_PS_2M	1
  #define TDX_PS_1G	2
+#define TDX_PS_NR	(TDX_PS_1G + 1)
/*
   * Used to gather the output registers values of the TDCALL and SEAMCALL
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 2ffc1517a93b..fd5417577f26 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -221,7 +221,7 @@ static int tdx_get_sysinfo(struct tdsysinfo_struct *sysinfo,
   * overlap.
   */
  static int add_tdx_memblock(struct list_head *tmb_list, unsigned long start_pfn,
-			    unsigned long end_pfn)
+			    unsigned long end_pfn, int nid)
  {
  	struct tdx_memblock *tmb;
@@ -232,6 +232,7 @@ static int add_tdx_memblock(struct list_head *tmb_list, unsigned long start_pfn,
  	INIT_LIST_HEAD(&tmb->list);
  	tmb->start_pfn = start_pfn;
  	tmb->end_pfn = end_pfn;
+	tmb->nid = nid;
/* @tmb_list is protected by mem_hotplug_lock */
  	list_add_tail(&tmb->list, tmb_list);
@@ -259,9 +260,9 @@ static void free_tdx_memlist(struct list_head *tmb_list)
  static int build_tdx_memlist(struct list_head *tmb_list)
  {
  	unsigned long start_pfn, end_pfn;
-	int i, ret;
+	int i, nid, ret;
- for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) {
+	for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
  		/*
  		 * The first 1MB is not reported as TDX convertible memory.
  		 * Although the first 1MB is always reserved and won't end up
@@ -277,7 +278,7 @@ static int build_tdx_memlist(struct list_head *tmb_list)
  		 * memblock has already guaranteed they are in address
  		 * ascending order and don't overlap.
  		 */
-		ret = add_tdx_memblock(tmb_list, start_pfn, end_pfn);
+		ret = add_tdx_memblock(tmb_list, start_pfn, end_pfn, nid);
  		if (ret)
  			goto err;

Why did you decide to defer remembering the nid as well? I'd just move that part to the patch that adds add_tdx_memblock().

--
Cheers,

David / dhildenb




[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux