Re: [LSF/MM] CXL Boot to Bash - Section 4: Interleave

Yuquan Wang <wangyuquan1236@xxxxxxxxxxxxxx> · Thu, 13 Mar 2025 16:31:31 +0800

On Tue, Mar 11, 2025 at 08:09:02PM -0400, Gregory Price wrote:
> 
> -----------------------------
> Intra-Host-Bridge Interleave.
> -----------------------------
> Now lets consider a system where we've placed 2 CXL devices on the same
> Host Bridge.  Maybe each CXL device is only capable of x8 PCIE, and we
> want to make full use of a single x16 link.
> 
> This setup only requires the BIOS to create a CEDT CFMWS which reports
> the entire capacity of all devices under the host bridge, but does not
> need to set up any interleaving.
> 
> In the follow case, the BIOS has configured as single 4GB memory region
> which only targets the single host bridge, but maps the entire memory
> capacity of both devices (2GB).
> 
> ```
> CFMWS:
>           Subtable Type : 01 [CXL Fixed Memory Window Structure]
>                 Reserved : 00
>                   Length : 002C
>                 Reserved : 00000000
>      Window base address : 0000000300000000   <- Memory Region
>              Window size : 0000000080000000   <- 2GB

I think is "Window size : 0000000100000000   <- 4GB" here.

> Interleave Members (2^n) : 00                 <- No host bridge interleave
>    Interleave Arithmetic : 00
>                 Reserved : 0000
>              Granularity : 00000000
>             Restrictions : 0006               <- Bit(2) - Volatile
>                    QtgId : 0001
>             First Target : 00000007           <- Host Bridge _UID
> ```
> 
> Assuming no other CEDT or SRAT entries exist, this will result in linux
> creating the following NUMA topology, where all CXL memory is in Node 1.
> 
> ```
> NUMA Structure:
>         ---------     --------   |    ----------
>         | cpu0  |-----| DRAM |---|----| Node 0 |
>         ---------     --------   |    ----------
>             |                    |
>          -------                 |    ----------
>          | HB0 |-----------------|----| Node 1 |
>          -------                 |    ----------
>         /       \                |
>    CXL Dev     CXL Dev           |
> ```
> 
> In this scenario, we program the decoders like so:
> ```
> Decoders
>                            CXL Root
>                               |
>                           decoder0.0
>                          IW:1  IG:256
>                   [0x300000000, 0x3FFFFFFFF]
>                               |
>                           Host Bridge
>                               |
>                           decoder1.0
>                          IW:2  IG:256
>                    [0x300000000, 0x3FFFFFFFF]
>                              /   \
>                    Endpoint 0     Endpoint 1
>                        |              |
>                    decoder2.0     decoder3.0
>                  IW:2  IG:256     IW:2  IG:256
>     [0x300000000, 0x3FFFFFFFF]    [0x300000000, 0x3FFFFFFFF]
> ```
> 
> The root decoder in this scenario does not participate in interleave,
> it simply forwards all accesses in this range to the host bridge.
> 
> The host bridge then applies the interleave across its connected devices
> and the decodes apply translation accordingly.
> 
> -----------------------
> Combination Interleave.
> -----------------------
> Lets consider now a system where 2 Host Bridges have 2 CXL devices each,
> and we want to interleave the entire set.  This requires us to make use
> of both inter and intra host bridge interleave.
> 
> First, we can interleave this with the a single CEDT entry, the same as
> the first inter-host-bridge CEDT (now assuming 1GB per device).
> 
> ```
>            Subtable Type : 01 [CXL Fixed Memory Window Structure]
>                 Reserved : 00
>                   Length : 002C
>                 Reserved : 00000000
>      Window base address : 0000000300000000   <- Memory Region
>              Window size : 0000000100000000   <- 4GB
> Interleave Members (2^n) : 01                 <- 2-way interleave
>    Interleave Arithmetic : 00
>                 Reserved : 0000
>              Granularity : 00000000
>             Restrictions : 0006               <- Bit(2) - Volatile
>                    QtgId : 0001
>             First Target : 00000007           <- Host Bridge _UID
>              Next Target : 00000006           <- Host Bridge _UID
> ```
> 
> This gives us a NUMA structure as follows:
> ```
> NUMA Structure:
> 
>          ----------     --------    |   ----------
>          |  cpu0  |-----| DRAM |----|---| Node 0 |
>          ----------     --------    |   ----------
>         /         \                 |
>     -------     -------             |   ----------
>     | HB0 |-----| HB1 |-------------|---| Node 1 |
>     -------     -------             |   ----------
>       / \         / \               |
>   CXL0   CXL1  CXL2  CXL3           |
> ```
> 
> And the respective decoder programming looks as follows
> ```
> Decoders:
>                              CXL  Root
>                                  |
>                              decoder0.0
>                             IW:2   IG:256
>                       [0x300000000, 0x3FFFFFFFF]
>                              /         \
>                 Host Bridge 7           Host Bridge 6
>                 /                                    \
>            decoder1.0                             decoder2.0
>           IW:2   IG:512                          IW:2   IG:512
>   [0x300000000, 0x3FFFFFFFFF]             [0x300000000, 0x3FFFFFFFF]
>             /    \                                  /    \
>    endpoint0      endpoint1                endpoint2      endpoint3
>       |               |                       |               |
>   decoder3.0      decoder4.0              decoder5.0      decoder6.0
>           IW:4  IG:256                            IW:4  IG:256
>   [0x300000000, 0x3FFFFFFFF]              [0x300000000, 0x3FFFFFFFF]
> ```
> 
> Notice at both the root and the host bridge, the Interleave Ways is 2.
> There are two targets at each level.  The host bridge has a granularity
> of 512 to capture its parent's ways and granularity (`2*256`).
> 
> Each decoder is programmed with the total number of targets (4) and the
> overall granularity (256B).

Is there any relationship between endpoints'decoder setup(IW&IG) and
others decoder?

> 
> We might use this setup if each CXL device is capable of x8 PCIE, and
> we have 2 Host Bridges capable of full x16 - utilizing all bandwidth
> available.
> 
> ---------------------------------------------
> Nuance: Hardware Interleave and Memory Holes.
> ---------------------------------------------
> You may encounter a system which cannot place the entire memory capacity
> into a single contiguous System Physical Address range.  That's ok,
> because we can just use multiple decoders to capture this nuance.
> 
> Most CXL devices allow for multiple decoders.
> 
> This may require an SRAT entry to keep these regions on the same node.
> (Obviously the relies on your platform vendor's BIOS)
> 
> ```
> CFMWS:
>          Subtable Type : 01 [CXL Fixed Memory Window Structure]
>                 Reserved : 00
>                   Length : 002C
>                 Reserved : 00000000
>      Window base address : 0000000300000000   <- Memory Region
>              Window size : 0000000080000000   <- 2GB
> Interleave Members (2^n) : 00                 <- No host bridge interleave
>    Interleave Arithmetic : 00
>                 Reserved : 0000
>              Granularity : 00000000
>             Restrictions : 0006               <- Bit(2) - Volatile
>                    QtgId : 0001
>             First Target : 00000007           <- Host Bridge 7
> 
>          Subtable Type : 01 [CXL Fixed Memory Window Structure]
>                 Reserved : 00
>                   Length : 002C
>                 Reserved : 00000000
>      Window base address : 0000000400000000   <- Memory Region
>              Window size : 0000000080000000   <- 2GB
> Interleave Members (2^n) : 00                 <- No host bridge interleave
>    Interleave Arithmetic : 00
>                 Reserved : 0000
>              Granularity : 00000000
>             Restrictions : 0006               <- Bit(2) - Volatile
>                    QtgId : 0001
>             First Target : 00000007           <- Host Bridge 7
> 
> SRAT:
>         Subtable Type : 01 [Memory Affinity]
>                Length : 28
>      Proximity Domain : 00000001          <- NUMA Node 1
>             Reserved1 : 0000
>          Base Address : 0000000300000000  <- Physical Memory Region
>        Address Length : 0000000080000000  <- first 2GB
> 
>         Subtable Type : 01 [Memory Affinity]
>                Length : 28
>      Proximity Domain : 00000001          <- NUMA Node 1
>             Reserved1 : 0000
>          Base Address : 0000000400000000  <- Physical Memory Region
>        Address Length : 0000000080000000  <- second 2GB
> ```
> 
> The SRAT entries allow us to keep the regions attached to the same node.
> ```
> 
> NUMA Structure:
>         ---------     --------   |    ----------
>         | cpu0  |-----| DRAM |---|----| Node 0 |
>         ---------     --------   |    ----------
>             |                    |
>          -------                 |    ----------
>          | HB0 |-----------------|----| Node 1 |
>          -------                 |    ----------
>         /       \                |
>    CXL Dev     CXL Dev           |
> ```
>
Hi, Gregory

Seeing this, I have an assumption to discuss.

If the same system uses tables like below:

CFMWS:
         Subtable Type : 01 [CXL Fixed Memory Window Structure]
                Reserved : 00
                  Length : 002C
                Reserved : 00000000
     Window base address : 0000000300000000   <- Memory Region
             Window size : 0000000080000000   <- 2GB
Interleave Members (2^n) : 00                 <- No host bridge interleave
   Interleave Arithmetic : 00
                Reserved : 0000
             Granularity : 00000000
            Restrictions : 0006               <- Bit(2) - Volatile
                   QtgId : 0001
            First Target : 00000007           <- Host Bridge 7

         Subtable Type : 01 [CXL Fixed Memory Window Structure]
                Reserved : 00
                  Length : 002C
                Reserved : 00000000
     Window base address : 0000000400000000   <- Memory Region
             Window size : 0000000080000000   <- 2GB
Interleave Members (2^n) : 00                 <- No host bridge interleave
   Interleave Arithmetic : 00
                Reserved : 0000
             Granularity : 00000000
            Restrictions : 0006               <- Bit(2) - Volatile
                   QtgId : 0001
            First Target : 00000007           <- Host Bridge 7

SRAT:
        Subtable Type : 01 [Memory Affinity]
               Length : 28
     Proximity Domain : 00000000          <- NUMA Node 0
            Reserved1 : 0000
         Base Address : 0000000300000000  <- Physical Memory Region
       Address Length : 0000000080000000  <- first 2GB

        Subtable Type : 01 [Memory Affinity]
               Length : 28
     Proximity Domain : 00000001          <- NUMA Node 1
            Reserved1 : 0000
         Base Address : 0000000400000000  <- Physical Memory Region
       Address Length : 0000000080000000  <- second 2GB

The first 2GB cxl memory region would locate at node0 with DRAM.

NUMA Structure:

        ---------     --------   |            ----------
        | cpu0  |-----| DRAM |---|------------| Node 0 |
        ---------     --------   |   /        ----------
            |                    |  /first 2GB
         -------                 | /          ----------
         | HB0 |-----------------|------------| Node 1 |
         -------                 |second 2GB  ----------
        /       \                |
   CXL Dev     CXL Dev           |
```

Is above configuration and structure valid?

Yuquan
> And the decoder programming would look like so
> ```
> Decoders:
>                                CXL  Root
>                              /           \
>                     decoder0.0           decoder0.1
>                   IW:1  IG:256           IW:1  IG:256
>     [0x300000000, 0x37FFFFFFF]           [0x400000000, 0x47FFFFFFF]
>                              \           /
>                               Host Bridge
>                              /           \
>                     decoder1.0           decoder1.1
>                   IW:2  IG:256           IW:2  IG:256
>     [0x300000000, 0x37FFFFFFF]           [0x400000000, 0x47FFFFFFF]
>               /   \                                /   \
>     Endpoint 0     Endpoint 1            Endpoint 0     Endpoint 1
>         |              |                     |              |
>     decoder2.0     decoder3.0            decoder2.1     decoder3.1
>             IW:2 IG:256                          IW:2 IG:256
>     [0x300000000, 0x37FFFFFFF]           [0x400000000, 0x47FFFFFFF]
> ```
> 
> Linux manages decoders in relation to the associated component, so
> decoders are N.M where N is the component and M is the decoder number.
> 
> If you look, you'll see each side of this tree looks individually
> equivalent to the intra-host-bridge interleave example, just with one
> half of the total memory each (matching the CFMWS ranges).
> 
> Each of the root decoders still has an interleave width of 1 because
> they both only target one host bridge (despite it being the same one).
> 
> 
> --------------------------------
> Software Interleave (Mempolicy).
> --------------------------------
> Linux provides a software mechanism to allow tasks to to interleave its
> memory across NUMA nodes - which may have different performance
> characteristics.  This component is called `mempolicy`, and is primarily
> operated on with the `set_mempolicy()` and `mbind()` syscalls.
> 
> These syscalls take a nodemask (bitmask representing NUMA node ids) as
> an argument to describe the intended allocation policy of the task.
> 
> The following policies are presently supported (as of v6.13)
> ```
> enum {
>         MPOL_DEFAULT,
>         MPOL_PREFERRED,
>         MPOL_BIND,
>         MPOL_INTERLEAVE,
>         MPOL_LOCAL,
>         MPOL_PREFERRED_MANY,
>         MPOL_WEIGHTED_INTERLEAVE,
> };
> ```
> 
> Let's look at `MPOL_INTERLEAVE` and `MPOL_WEIGHTED_INTERLEAVE`.
> 
> To quote the man page:
> ```
> MPOL_INTERLEAVE
>     This  mode  interleaves  page  allocations  across the nodes specified
>     in nodemask in numeric node ID order.  This optimizes for bandwidth
>     instead of latency by spreading out pages and memory accesses to those
>     pages across multiple nodes.  However, accesses to a single page will
>     still be limited to the memory bandwidth of a single node.
> 
> MPOL_WEIGHTED_INTERLEAVE (since Linux 6.9)
>     This mode interleaves page allocations across the nodes specified in
>     nodemask according to the weights in
>         /sys/kernel/mm/mempolicy/weighted_interleave
>     For example, if bits 0, 2, and 5 are set in nodemask and the contents of
>         /sys/kernel/mm/mempolicy/weighted_interleave/node0
>         /sys/ ... /node2
>         /sys/ ... /node5
>     are 4, 7, and 9, respectively, then pages in this region will be
>     allocated on nodes 0, 2, and 5 in a 4:7:9 ratio.
> ```
> 
> To put it simply, MPOL_INTERLEAVE will interleave allocations at a page
> granularity (4KB, 2MB, etc) across nodes in a 1:1 ratio, while
> MPOL_WEIGHTED_INTERLEAVE takes into account weights - which presumably
> map to the bandwidth of each respective node.
> 
> Or more concretely:
> 
> MPOL_INTERLEAVE
>     1:1 Interleave between two nodes.
>     malloc(4096)  ->  node0
>     malloc(4096)  ->  node1
>     malloc(4096)  ->  node0
>     malloc(4096)  ->  node1
>     ... and so on ...
> 
> MPOL_WEIGHTED_INTERLEAVE
>     2:1 Interleave between two nodes.
>     malloc(4096)  ->  node0
>     malloc(4096)  ->  node0
>     malloc(4096)  ->  node1
>     malloc(4096)  ->  node0
>     malloc(4096)  ->  node0
>     malloc(4096)  ->  node1
>     ... and so on ...
> 
> This is the preferred mechanism for *heterogeneous interleave* on Linux,
> as it allows for predictable performance based on the explicit (and
> visible) placement of memory.
> 
> It also allows for memory ZONE restrictions to enable better performance
> predictability (e.g. keeping kernel locks out of CXL while allowing
> workloads to leverage it for expansion or bandwidth).
> 
> ======================
> Mempolicy Limitations.
> ======================
> Mempolicy is a *per-task* allocation policy that is inherited by
> child-tasks on clone/fork. It can only be changed by the task itself,
> though cgroups may affect the effective nodemask via cpusets.
> 
> This means once a task has been launched, and external actor cannot
> change the policy of a running task - except possibly by migrating that
> task between cgroups or changing the cpusets.mems value of the cgroup
> the task lives in.
> 
> Additionally, If capacity on a given node is not available, allocations
> will fall back to another node in the node mask - which may cause
> interleave to become unbalanced.
> 
> ================================
> Hardware Interleave Limitations.
> ================================
> Granularities:
>    granularities are limited on hardware
>    (typically 256B up to 16KB by power of 2)
> 
> Ways:
>    Ways are limited by the CXL configuration to:
>    2,4,8,16,3,6,12
> 
> Balance:
>    Linux does not allow imbalanced interleave configurations
>    (e.g. 3-way interleave where 2 targets are on 1 HB and 1 on another)
> 
> Depending on your platform vendor and type of interleave, you may not
> be able to deconstruct an interleave region at all (decoders may be
> locked).  In this case, you may not have the flexiblity to convert
> operation from interleaved to non-interleave via the driver interface.
> 
> In the scenario where your interleave configuration is entirely driver
> managed, you cannot adjust the size of an interleave set without
> deconstructing the entire set.
> 
> ------------------------------------------------------------------------
> 
> Next we'll discuss how memory allocations occur in a CXL-enabled system,
> which may be affected by things like Reclaim and Tiering systems.
> 
> ~Gregory