On Wed, Mar 05, 2025 at 05:20:52PM -0500, Gregory Price wrote: > ==== > SRAT > ==== > The System/Static Resource Affinity Table describes resource (CPU, > Memory) affinity to "Proximity Domains". This table is technically > optional, but for performance information (see "HMAT") to be enumerated > by linux it must be present. > > > # Proximity Domain > A proximity domain is ROUGHLY equivalent to "NUMA Node" - though a > 1-to-1 mapping is not guaranteed. There are scenarios where "Proximity > Domain 4" may map to "NUMA Node 3", for example. (See "NUMA Node Creation") > > # Memory Affinity > Generally speaking, if a host does any amount of CXL fabric (decoder) > programming in BIOS - an SRAT entry for that memory needs to be present. > > ``` > Subtable Type : 01 [Memory Affinity] > Length : 28 > Proximity Domain : 00000001 <- NUMA Node 1 > Reserved1 : 0000 > Base Address : 000000C050000000 <- Physical Memory Region > Address Length : 0000003CA0000000 > Reserved2 : 00000000 > Flags (decoded below) : 0000000B > Enabled : 1 > Hot Pluggable : 1 > Non-Volatile : 0 > ``` > > # Generic Initiator / Port > In the scenario where CXL devices are not present or configured by > BIOS, we may still want to generate proximity domain configurations > for those devices. The Generic Initiator interfaces are intended to > fill this gap, so that performance information can still be utilized > when the devices become available at runtime. > > I won't cover the details here, for now, but I will link to the > proosal from Dan Williams and Jonathan Cameron if you would like > more information. > https://lore.kernel.org/all/e1a52da9aec90766da5de51b1b839fd95d63a5af.camel@xxxxxxxxx/ > > ==== > HMAT > ==== > The Heterogeneous Memory Attributes Table contains information such as > cache attributes and bandwidth and latency details for memory proximity > domains. For the purpose of this document, we will only discuss the > SSLIB entry. > > # SLLBI > The System Locality Latency and Bandwidth Information records latency > and bandwidth information for proximity domains. This table is used by > Linux to configure interleave weights and memory tiers. > > ``` > Heavily truncated for brevity > Structure Type : 0001 [SLLBI] > Data Type : 00 <- Latency > Target Proximity Domain List : 00000000 > Target Proximity Domain List : 00000001 > Entry : 0080 <- DRAM LTC > Entry : 0100 <- CXL LTC > > Structure Type : 0001 [SLLBI] > Data Type : 03 <- Bandwidth > Target Proximity Domain List : 00000000 > Target Proximity Domain List : 00000001 > Entry : 1200 <- DRAM BW > Entry : 0200 <- CXL BW > ``` > > > --------------------------------- > Part 00: Linux Resource Creation. > --------------------------------- > > ================== > NUMA node creation > =================== > NUMA nodes are *NOT* hot-pluggable. All *POSSIBLE* NUMA nodes are > identified at `__init` time, more specifically during `mm_init`. > > What this means is that the CEDT and SRAT must contain sufficient > `proximity domain` information for linux to identify how many NUMA > nodes are required (and what memory regions to associate with them). > Hi, Gregory. Recently, I found a corner case in CXL numa node creation. Condition: 1) A UMA/NUMA system that SRAT is absence, but it keeps CEDT.CFMWS 2)Enable CONFIG_ACPI_NUMA Results: 1) acpi_numa_init: the fake_pxm will be 0 and send to acpi_parse_cfmws() 2)If dynamically create cxl ram region, the cxl memory would be assigned to node0 rather than a fake new node. Confusions: 1) Does CXL memory usage require a numa system with SRAT? As you mentioned in SRAT section: "This table is technically optional, but for performance information to be enumerated by linux it must be present." Hence, as I understand it, it seems a bug in kernel. 2) If it is a bug, could we forbid this situation by adding fake_pxm check and returning error in acpi_numa_init()? 3)If not, maybe we can add some kernel logic to allow create these fake nodes on a system without SRAT? Yuquan > The relevant code exists in: linux/drivers/acpi/numa/srat.c > ``` > static int __init > acpi_parse_memory_affinity(union acpi_subtable_headers *header, > const unsigned long table_end) > { > ... heavily truncated for brevity > pxm = ma->proximity_domain; > node = acpi_map_pxm_to_node(pxm); > if (numa_add_memblk(node, start, end) < 0) > .... > node_set(node, numa_nodes_parsed); <--- mark node N_POSSIBLE > } > > static int __init acpi_parse_cfmws(union acpi_subtable_headers *header, > void *arg, const unsigned long table_end) > { > ... heavily truncated for brevity > /* > * The SRAT may have already described NUMA details for all, > * or a portion of, this CFMWS HPA range. Extend the memblks > * found for any portion of the window to cover the entire > * window. > */ > if (!numa_fill_memblks(start, end)) > return 0; > > /* No SRAT description. Create a new node. */ > node = acpi_map_pxm_to_node(*fake_pxm); > if (numa_add_memblk(node, start, end) < 0) > .... > node_set(node, numa_nodes_parsed); <--- mark node N_POSSIBLE > } > > int __init acpi_numa_init(void) > { > ... > if (!acpi_table_parse(ACPI_SIG_SRAT, acpi_parse_srat)) { > cnt = acpi_table_parse_srat(ACPI_SRAT_TYPE_MEMORY_AFFINITY, > acpi_parse_memory_affinity, 0); > } > /* fake_pxm is the next unused PXM value after SRAT parsing */ > acpi_table_parse_cedt(ACPI_CEDT_TYPE_CFMWS, acpi_parse_cfmws, > &fake_pxm); > > ``` > > Basically, the heuristic is as follows: > 1) Add one NUMA node per Proximity Domain described in SRAT > 2) If the SRAT describes all memory described by all CFMWS > - do not create nodes for CFMWS > 3) If SRAT does not describe all memory described by CFMWS > - create a node for that CFMWS > > Generally speaking, you will see one NUMA node per Host bridge, unless > inter-host-bridge interleave is in use (see Section 4 - Interleave). > > > ============ > Memory Tiers > ============ > The `abstract distance` of a node dictates what tier it lands in (and > therefore, what tiers are created). This is calculated based on the > following heuristic, using HMAT data: > > ``` > int mt_perf_to_adistance(struct access_coordinate *perf, int *adist) > { > ... > /* > * The abstract distance of a memory node is in direct proportion to > * its memory latency (read + write) and inversely proportional to its > * memory bandwidth (read + write). The abstract distance, memory > * latency, and memory bandwidth of the default DRAM nodes are used as > * the base. > */ > *adist = MEMTIER_ADISTANCE_DRAM * > (perf->read_latency + perf->write_latency) / > (default_dram_perf.read_latency + default_dram_perf.write_latency) * > (default_dram_perf.read_bandwidth + default_dram_perf.write_bandwidth) / > (perf->read_bandwidth + perf->write_bandwidth); > return 0; > } > ``` > > Debugging hint: If you have DRAM and CXL memory in separate numa nodes > but only find 1 memory tier, validate the HMAT! > > > ============================ > Memory Tier Demotion Targets > ============================ > When `demotion` is enabled (see Section 5 - allocation), the reclaim > system may opportunistically demote a page from one memory tier to > another. The selection of a `demotion target` is partially based on > Abstract Distance and Performance Data. > > ``` > An example of demotion targets from memory-tiers.c > /* Example 1: > * > * Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM nodes. > * > * node distances: > * node 0 1 2 3 > * 0 10 20 30 40 > * 1 20 10 40 30 > * 2 30 40 10 40 > * 3 40 30 40 10 > * > * memory_tiers0 = 0-1 > * memory_tiers1 = 2-3 > * > * node_demotion[0].preferred = 2 > * node_demotion[1].preferred = 3 > * node_demotion[2].preferred = <empty> > * node_demotion[3].preferred = <empty> > */ > ``` > > ============================= > Mempolicy Weighted Interleave > ============================= > The `weighted interleave` functionality of `mempolicy` utilizes weights > to distribute memory across NUMA nodes according to some set weight. > There is a proposal to auto-configure these weights based on HMAT data. > > https://lore.kernel.org/linux-mm/20250305200506.2529583-1-joshua.hahnjy@xxxxxxxxx/T/#u > > See Section 4 - Interleave, for more information on weighted interleave. > > > > -------------- > Build Options. > -------------- > We can add these build configurations to our complexity picture. > > CONFIG_NUMA - req for ACPI numa, mempolicy, and memory tiers > CONFIG_ACPI_NUMA -- enables srat and cedt parsing > CONFIG_ACPI_HMAT -- enables hmat parsing > > > ~Gregory