On Tue, Mar 11, 2025 at 08:09:02PM -0400, Gregory Price wrote: > > ----------------------------- > Intra-Host-Bridge Interleave. > ----------------------------- > Now lets consider a system where we've placed 2 CXL devices on the same > Host Bridge. Maybe each CXL device is only capable of x8 PCIE, and we > want to make full use of a single x16 link. > > This setup only requires the BIOS to create a CEDT CFMWS which reports > the entire capacity of all devices under the host bridge, but does not > need to set up any interleaving. > > In the follow case, the BIOS has configured as single 4GB memory region > which only targets the single host bridge, but maps the entire memory > capacity of both devices (2GB). > > ``` > CFMWS: > Subtable Type : 01 [CXL Fixed Memory Window Structure] > Reserved : 00 > Length : 002C > Reserved : 00000000 > Window base address : 0000000300000000 <- Memory Region > Window size : 0000000080000000 <- 2GB I think is "Window size : 0000000100000000 <- 4GB" here. > Interleave Members (2^n) : 00 <- No host bridge interleave > Interleave Arithmetic : 00 > Reserved : 0000 > Granularity : 00000000 > Restrictions : 0006 <- Bit(2) - Volatile > QtgId : 0001 > First Target : 00000007 <- Host Bridge _UID > ``` > > Assuming no other CEDT or SRAT entries exist, this will result in linux > creating the following NUMA topology, where all CXL memory is in Node 1. > > ``` > NUMA Structure: > --------- -------- | ---------- > | cpu0 |-----| DRAM |---|----| Node 0 | > --------- -------- | ---------- > | | > ------- | ---------- > | HB0 |-----------------|----| Node 1 | > ------- | ---------- > / \ | > CXL Dev CXL Dev | > ``` > > In this scenario, we program the decoders like so: > ``` > Decoders > CXL Root > | > decoder0.0 > IW:1 IG:256 > [0x300000000, 0x3FFFFFFFF] > | > Host Bridge > | > decoder1.0 > IW:2 IG:256 > [0x300000000, 0x3FFFFFFFF] > / \ > Endpoint 0 Endpoint 1 > | | > decoder2.0 decoder3.0 > IW:2 IG:256 IW:2 IG:256 > [0x300000000, 0x3FFFFFFFF] [0x300000000, 0x3FFFFFFFF] > ``` > > The root decoder in this scenario does not participate in interleave, > it simply forwards all accesses in this range to the host bridge. > > The host bridge then applies the interleave across its connected devices > and the decodes apply translation accordingly. > > ----------------------- > Combination Interleave. > ----------------------- > Lets consider now a system where 2 Host Bridges have 2 CXL devices each, > and we want to interleave the entire set. This requires us to make use > of both inter and intra host bridge interleave. > > First, we can interleave this with the a single CEDT entry, the same as > the first inter-host-bridge CEDT (now assuming 1GB per device). > > ``` > Subtable Type : 01 [CXL Fixed Memory Window Structure] > Reserved : 00 > Length : 002C > Reserved : 00000000 > Window base address : 0000000300000000 <- Memory Region > Window size : 0000000100000000 <- 4GB > Interleave Members (2^n) : 01 <- 2-way interleave > Interleave Arithmetic : 00 > Reserved : 0000 > Granularity : 00000000 > Restrictions : 0006 <- Bit(2) - Volatile > QtgId : 0001 > First Target : 00000007 <- Host Bridge _UID > Next Target : 00000006 <- Host Bridge _UID > ``` > > This gives us a NUMA structure as follows: > ``` > NUMA Structure: > > ---------- -------- | ---------- > | cpu0 |-----| DRAM |----|---| Node 0 | > ---------- -------- | ---------- > / \ | > ------- ------- | ---------- > | HB0 |-----| HB1 |-------------|---| Node 1 | > ------- ------- | ---------- > / \ / \ | > CXL0 CXL1 CXL2 CXL3 | > ``` > > And the respective decoder programming looks as follows > ``` > Decoders: > CXL Root > | > decoder0.0 > IW:2 IG:256 > [0x300000000, 0x3FFFFFFFF] > / \ > Host Bridge 7 Host Bridge 6 > / \ > decoder1.0 decoder2.0 > IW:2 IG:512 IW:2 IG:512 > [0x300000000, 0x3FFFFFFFFF] [0x300000000, 0x3FFFFFFFF] > / \ / \ > endpoint0 endpoint1 endpoint2 endpoint3 > | | | | > decoder3.0 decoder4.0 decoder5.0 decoder6.0 > IW:4 IG:256 IW:4 IG:256 > [0x300000000, 0x3FFFFFFFF] [0x300000000, 0x3FFFFFFFF] > ``` > > Notice at both the root and the host bridge, the Interleave Ways is 2. > There are two targets at each level. The host bridge has a granularity > of 512 to capture its parent's ways and granularity (`2*256`). > > Each decoder is programmed with the total number of targets (4) and the > overall granularity (256B). Is there any relationship between endpoints'decoder setup(IW&IG) and others decoder? > > We might use this setup if each CXL device is capable of x8 PCIE, and > we have 2 Host Bridges capable of full x16 - utilizing all bandwidth > available. > > --------------------------------------------- > Nuance: Hardware Interleave and Memory Holes. > --------------------------------------------- > You may encounter a system which cannot place the entire memory capacity > into a single contiguous System Physical Address range. That's ok, > because we can just use multiple decoders to capture this nuance. > > Most CXL devices allow for multiple decoders. > > This may require an SRAT entry to keep these regions on the same node. > (Obviously the relies on your platform vendor's BIOS) > > ``` > CFMWS: > Subtable Type : 01 [CXL Fixed Memory Window Structure] > Reserved : 00 > Length : 002C > Reserved : 00000000 > Window base address : 0000000300000000 <- Memory Region > Window size : 0000000080000000 <- 2GB > Interleave Members (2^n) : 00 <- No host bridge interleave > Interleave Arithmetic : 00 > Reserved : 0000 > Granularity : 00000000 > Restrictions : 0006 <- Bit(2) - Volatile > QtgId : 0001 > First Target : 00000007 <- Host Bridge 7 > > Subtable Type : 01 [CXL Fixed Memory Window Structure] > Reserved : 00 > Length : 002C > Reserved : 00000000 > Window base address : 0000000400000000 <- Memory Region > Window size : 0000000080000000 <- 2GB > Interleave Members (2^n) : 00 <- No host bridge interleave > Interleave Arithmetic : 00 > Reserved : 0000 > Granularity : 00000000 > Restrictions : 0006 <- Bit(2) - Volatile > QtgId : 0001 > First Target : 00000007 <- Host Bridge 7 > > SRAT: > Subtable Type : 01 [Memory Affinity] > Length : 28 > Proximity Domain : 00000001 <- NUMA Node 1 > Reserved1 : 0000 > Base Address : 0000000300000000 <- Physical Memory Region > Address Length : 0000000080000000 <- first 2GB > > Subtable Type : 01 [Memory Affinity] > Length : 28 > Proximity Domain : 00000001 <- NUMA Node 1 > Reserved1 : 0000 > Base Address : 0000000400000000 <- Physical Memory Region > Address Length : 0000000080000000 <- second 2GB > ``` > > The SRAT entries allow us to keep the regions attached to the same node. > ``` > > NUMA Structure: > --------- -------- | ---------- > | cpu0 |-----| DRAM |---|----| Node 0 | > --------- -------- | ---------- > | | > ------- | ---------- > | HB0 |-----------------|----| Node 1 | > ------- | ---------- > / \ | > CXL Dev CXL Dev | > ``` > Hi, Gregory Seeing this, I have an assumption to discuss. If the same system uses tables like below: CFMWS: Subtable Type : 01 [CXL Fixed Memory Window Structure] Reserved : 00 Length : 002C Reserved : 00000000 Window base address : 0000000300000000 <- Memory Region Window size : 0000000080000000 <- 2GB Interleave Members (2^n) : 00 <- No host bridge interleave Interleave Arithmetic : 00 Reserved : 0000 Granularity : 00000000 Restrictions : 0006 <- Bit(2) - Volatile QtgId : 0001 First Target : 00000007 <- Host Bridge 7 Subtable Type : 01 [CXL Fixed Memory Window Structure] Reserved : 00 Length : 002C Reserved : 00000000 Window base address : 0000000400000000 <- Memory Region Window size : 0000000080000000 <- 2GB Interleave Members (2^n) : 00 <- No host bridge interleave Interleave Arithmetic : 00 Reserved : 0000 Granularity : 00000000 Restrictions : 0006 <- Bit(2) - Volatile QtgId : 0001 First Target : 00000007 <- Host Bridge 7 SRAT: Subtable Type : 01 [Memory Affinity] Length : 28 Proximity Domain : 00000000 <- NUMA Node 0 Reserved1 : 0000 Base Address : 0000000300000000 <- Physical Memory Region Address Length : 0000000080000000 <- first 2GB Subtable Type : 01 [Memory Affinity] Length : 28 Proximity Domain : 00000001 <- NUMA Node 1 Reserved1 : 0000 Base Address : 0000000400000000 <- Physical Memory Region Address Length : 0000000080000000 <- second 2GB The first 2GB cxl memory region would locate at node0 with DRAM. NUMA Structure: --------- -------- | ---------- | cpu0 |-----| DRAM |---|------------| Node 0 | --------- -------- | / ---------- | | /first 2GB ------- | / ---------- | HB0 |-----------------|------------| Node 1 | ------- |second 2GB ---------- / \ | CXL Dev CXL Dev | ``` Is above configuration and structure valid? Yuquan > And the decoder programming would look like so > ``` > Decoders: > CXL Root > / \ > decoder0.0 decoder0.1 > IW:1 IG:256 IW:1 IG:256 > [0x300000000, 0x37FFFFFFF] [0x400000000, 0x47FFFFFFF] > \ / > Host Bridge > / \ > decoder1.0 decoder1.1 > IW:2 IG:256 IW:2 IG:256 > [0x300000000, 0x37FFFFFFF] [0x400000000, 0x47FFFFFFF] > / \ / \ > Endpoint 0 Endpoint 1 Endpoint 0 Endpoint 1 > | | | | > decoder2.0 decoder3.0 decoder2.1 decoder3.1 > IW:2 IG:256 IW:2 IG:256 > [0x300000000, 0x37FFFFFFF] [0x400000000, 0x47FFFFFFF] > ``` > > Linux manages decoders in relation to the associated component, so > decoders are N.M where N is the component and M is the decoder number. > > If you look, you'll see each side of this tree looks individually > equivalent to the intra-host-bridge interleave example, just with one > half of the total memory each (matching the CFMWS ranges). > > Each of the root decoders still has an interleave width of 1 because > they both only target one host bridge (despite it being the same one). > > > -------------------------------- > Software Interleave (Mempolicy). > -------------------------------- > Linux provides a software mechanism to allow tasks to to interleave its > memory across NUMA nodes - which may have different performance > characteristics. This component is called `mempolicy`, and is primarily > operated on with the `set_mempolicy()` and `mbind()` syscalls. > > These syscalls take a nodemask (bitmask representing NUMA node ids) as > an argument to describe the intended allocation policy of the task. > > The following policies are presently supported (as of v6.13) > ``` > enum { > MPOL_DEFAULT, > MPOL_PREFERRED, > MPOL_BIND, > MPOL_INTERLEAVE, > MPOL_LOCAL, > MPOL_PREFERRED_MANY, > MPOL_WEIGHTED_INTERLEAVE, > }; > ``` > > Let's look at `MPOL_INTERLEAVE` and `MPOL_WEIGHTED_INTERLEAVE`. > > To quote the man page: > ``` > MPOL_INTERLEAVE > This mode interleaves page allocations across the nodes specified > in nodemask in numeric node ID order. This optimizes for bandwidth > instead of latency by spreading out pages and memory accesses to those > pages across multiple nodes. However, accesses to a single page will > still be limited to the memory bandwidth of a single node. > > MPOL_WEIGHTED_INTERLEAVE (since Linux 6.9) > This mode interleaves page allocations across the nodes specified in > nodemask according to the weights in > /sys/kernel/mm/mempolicy/weighted_interleave > For example, if bits 0, 2, and 5 are set in nodemask and the contents of > /sys/kernel/mm/mempolicy/weighted_interleave/node0 > /sys/ ... /node2 > /sys/ ... /node5 > are 4, 7, and 9, respectively, then pages in this region will be > allocated on nodes 0, 2, and 5 in a 4:7:9 ratio. > ``` > > To put it simply, MPOL_INTERLEAVE will interleave allocations at a page > granularity (4KB, 2MB, etc) across nodes in a 1:1 ratio, while > MPOL_WEIGHTED_INTERLEAVE takes into account weights - which presumably > map to the bandwidth of each respective node. > > Or more concretely: > > MPOL_INTERLEAVE > 1:1 Interleave between two nodes. > malloc(4096) -> node0 > malloc(4096) -> node1 > malloc(4096) -> node0 > malloc(4096) -> node1 > ... and so on ... > > MPOL_WEIGHTED_INTERLEAVE > 2:1 Interleave between two nodes. > malloc(4096) -> node0 > malloc(4096) -> node0 > malloc(4096) -> node1 > malloc(4096) -> node0 > malloc(4096) -> node0 > malloc(4096) -> node1 > ... and so on ... > > This is the preferred mechanism for *heterogeneous interleave* on Linux, > as it allows for predictable performance based on the explicit (and > visible) placement of memory. > > It also allows for memory ZONE restrictions to enable better performance > predictability (e.g. keeping kernel locks out of CXL while allowing > workloads to leverage it for expansion or bandwidth). > > ====================== > Mempolicy Limitations. > ====================== > Mempolicy is a *per-task* allocation policy that is inherited by > child-tasks on clone/fork. It can only be changed by the task itself, > though cgroups may affect the effective nodemask via cpusets. > > This means once a task has been launched, and external actor cannot > change the policy of a running task - except possibly by migrating that > task between cgroups or changing the cpusets.mems value of the cgroup > the task lives in. > > Additionally, If capacity on a given node is not available, allocations > will fall back to another node in the node mask - which may cause > interleave to become unbalanced. > > ================================ > Hardware Interleave Limitations. > ================================ > Granularities: > granularities are limited on hardware > (typically 256B up to 16KB by power of 2) > > Ways: > Ways are limited by the CXL configuration to: > 2,4,8,16,3,6,12 > > Balance: > Linux does not allow imbalanced interleave configurations > (e.g. 3-way interleave where 2 targets are on 1 HB and 1 on another) > > Depending on your platform vendor and type of interleave, you may not > be able to deconstruct an interleave region at all (decoders may be > locked). In this case, you may not have the flexiblity to convert > operation from interleaved to non-interleave via the driver interface. > > In the scenario where your interleave configuration is entirely driver > managed, you cannot adjust the size of an interleave set without > deconstructing the entire set. > > ------------------------------------------------------------------------ > > Next we'll discuss how memory allocations occur in a CXL-enabled system, > which may be affected by things like Reclaim and Tiering systems. > > ~Gregory