[LSF/MM] CXL Boot to Bash - Section 4: Interleave

Gregory Price <gourry@xxxxxxxxxx> · Tue, 11 Mar 2025 20:09:02 -0400

In this section we'll cover a few different interleave mechanisms, some
of which require CXL decoder programming.  We'll discuss some of the
platform implications of hardware-interleave and how that affects driver
support, as well as software-based interleave.

Terminology
- Interleave Ways (IW):  Number of downstream targets in the interleave

- Interleave Granularity (IG):  The size of the interleaved data
                                (typically 256B-16KB, or 1 page)

- Hardware Interleave: Interleave done in CXL decoders

- Software Interleave: Interleave done by Linux (mempolicy, libnuma).

--------------------------------
Hardware Vs Software Interleave.
--------------------------------
CXL Hardware interleave is a memory interleave mechanism which utilizes
hardware decoders to spread System Physical Address accesses across a
number of devices transparent to the operating system.  A similar
technique is used on DRAM attached to a single socket.

We imagine physical memory as a linear construct, where physical address
implies the use of a specific piece of hardware.  In reality, hardware
interleave spreads access (typically at some multiple of cache line
granularity) across many devices to make better use bandwidth.

Imagine a system with 4GB of RAM in the address range 0x0-0xFFFFFFFF.
We often think of this memory linearly, where the first 2GB might be
the first DIMM and the second 2GB belong to the next.  But in reality,
when hardware interleave is in use, it may spread cache lines per-dimm.

     Simple Model                    Reality
    ---------------              ---------------
    |    DIMM0    |  0x00000000  |    DIMM0    |
    |    DIMM0    |              |    DIMM1    |
    |    DIMM0    |     ...      |    DIMM0    |
    |    DIMM0    |              |    DIMM1    |
    |    DIMM1    |  0x80000000  |    DIMM0    |
    |    DIMM1    |              |    DIMM1    |
    |    DIMM1    |              |    DIMM0    |
    |    DIMM1    |              |    DIMM1    |
    ---------------              ---------------

Software interleave, by contrast, concerns itself with managing
interleave among multiple NUMA nodes - where each node has different
performance characteristics.  This is typically done on a page-boundary
and is enforced by the kernel allocation and mempolicy system.

You can visualize this as a series of allocation calls returning pages
on different nodes.  In reality this occurs on fault (first access)
instead of malloc, but this is an easier way to think about it.

    1:1 Interleave between two nodes.
    malloc(4096)  ->  node0
    malloc(4096)  ->  node1
    malloc(4096)  ->  node0
    malloc(4096)  ->  node1
    ... and so on ...

These techniques are not mutually exclusive, and the granularity/ways
of interleave may differ between hardware and software interleave.

-----------------------------
Inter-Host-Bridge Interleave.
-----------------------------
Imagine we have a system configuration where we've placed 2 CXL devices
on their own dedicated Host Bridge.  Maybe each CXL device is capable of
a full x16 PCIE link, and we want to aggregate the bandwidth of these
devices by interleave across host bridges.

This setup will require the BIOS to create a CEDT CFMWS which reports the
intent to interleave across host bridges.  This is typically because the
chipset memory controller needs to be made aware of how to route accesses
to host bridge, which is platform specific.

In the follow case, the BIOS has configured as single 4GB memory region
which interleaves capacity across two Host Bridges (7 and 6).

```
CFMWS:
           Subtable Type : 01 [CXL Fixed Memory Window Structure]
                Reserved : 00
                  Length : 002C
                Reserved : 00000000
     Window base address : 0000000300000000   <- Memory Region
             Window size : 0000000100000000   <- 4GB
Interleave Members (2^n) : 01                 <- 2-way interleave
   Interleave Arithmetic : 00
                Reserved : 0000
             Granularity : 00000000
            Restrictions : 0006               <- Bit(2) - Volatile
                   QtgId : 0001
            First Target : 00000007           <- Host Bridge _UID
             Next Target : 00000006           <- Host Bridge _UID
```

Assuming no other CEDT or SRAT entries exist, this will result in Linux
creating the following NUMA topology, where all CXL memory is in Node 1.

```
NUMA Structure:

         ----------     --------   |    ----------
         |  cpu0  |-----| DRAM |---|----| Node 0 |
         ----------     --------   |    ----------
        /         \                |
    -------     -------            |    ----------
    | HB0 |-----| HB1 |------------|----| Node 1 |
    -------     -------            |    ----------
       |           |               |
    CXL Dev     CXL Dev            |
```

In this scenario, we program the decoders like so:

```
Decoders:
                             CXL  Root
                                 |
                             decoder0.0
                            IW:2   IG:256
                     [0x300000000, 0x3FFFFFFFF]
                             /         \
                Host Bridge 7           Host Bridge 6
                /                                    \
           decoder1.0                             decoder2.0
          IW:1   IG:512                          IW:1   IG:512
  [0x300000000, 0x3FFFFFFFFF]             [0x300000000, 0x3FFFFFFFF]
               |                                      |
           Endpoint 0                             Endpoint 1
               |                                      |
           decoder3.0                             decoder4.0
          IW:2   IG:256                          IW:2   IG:256
  [0x300000000, 0x3FFFFFFFF]              [0x300000000, 0x3FFFFFFFF]
```

Notice the Host Bridge ways and granularity differ from the root and
endpoints. In the fabric (root through everything but endpoints),
Interleave ways are *target-count per-leg* and the granularity is the
parent's (IW * IG).

Host Bridge Decoder:
   IW = 1 = number of targets
   IG = 512 = Parent IW * Parent IG  (2 * 256)

-----------------------------
Intra-Host-Bridge Interleave.
-----------------------------
Now lets consider a system where we've placed 2 CXL devices on the same
Host Bridge.  Maybe each CXL device is only capable of x8 PCIE, and we
want to make full use of a single x16 link.

This setup only requires the BIOS to create a CEDT CFMWS which reports
the entire capacity of all devices under the host bridge, but does not
need to set up any interleaving.

In the follow case, the BIOS has configured as single 4GB memory region
which only targets the single host bridge, but maps the entire memory
capacity of both devices (2GB).

```
CFMWS:
          Subtable Type : 01 [CXL Fixed Memory Window Structure]
                Reserved : 00
                  Length : 002C
                Reserved : 00000000
     Window base address : 0000000300000000   <- Memory Region
             Window size : 0000000080000000   <- 2GB
Interleave Members (2^n) : 00                 <- No host bridge interleave
   Interleave Arithmetic : 00
                Reserved : 0000
             Granularity : 00000000
            Restrictions : 0006               <- Bit(2) - Volatile
                   QtgId : 0001
            First Target : 00000007           <- Host Bridge _UID
```

Assuming no other CEDT or SRAT entries exist, this will result in linux
creating the following NUMA topology, where all CXL memory is in Node 1.

```
NUMA Structure:
        ---------     --------   |    ----------
        | cpu0  |-----| DRAM |---|----| Node 0 |
        ---------     --------   |    ----------
            |                    |
         -------                 |    ----------
         | HB0 |-----------------|----| Node 1 |
         -------                 |    ----------
        /       \                |
   CXL Dev     CXL Dev           |
```

In this scenario, we program the decoders like so:
```
Decoders
                           CXL Root
                              |
                          decoder0.0
                         IW:1  IG:256
                  [0x300000000, 0x3FFFFFFFF]
                              |
                          Host Bridge
                              |
                          decoder1.0
                         IW:2  IG:256
                   [0x300000000, 0x3FFFFFFFF]
                             /   \
                   Endpoint 0     Endpoint 1
                       |              |
                   decoder2.0     decoder3.0
                 IW:2  IG:256     IW:2  IG:256
    [0x300000000, 0x3FFFFFFFF]    [0x300000000, 0x3FFFFFFFF]
```

The root decoder in this scenario does not participate in interleave,
it simply forwards all accesses in this range to the host bridge.

The host bridge then applies the interleave across its connected devices
and the decodes apply translation accordingly.

-----------------------
Combination Interleave.
-----------------------
Lets consider now a system where 2 Host Bridges have 2 CXL devices each,
and we want to interleave the entire set.  This requires us to make use
of both inter and intra host bridge interleave.

First, we can interleave this with the a single CEDT entry, the same as
the first inter-host-bridge CEDT (now assuming 1GB per device).

```
           Subtable Type : 01 [CXL Fixed Memory Window Structure]
                Reserved : 00
                  Length : 002C
                Reserved : 00000000
     Window base address : 0000000300000000   <- Memory Region
             Window size : 0000000100000000   <- 4GB
Interleave Members (2^n) : 01                 <- 2-way interleave
   Interleave Arithmetic : 00
                Reserved : 0000
             Granularity : 00000000
            Restrictions : 0006               <- Bit(2) - Volatile
                   QtgId : 0001
            First Target : 00000007           <- Host Bridge _UID
             Next Target : 00000006           <- Host Bridge _UID
```

This gives us a NUMA structure as follows:
```
NUMA Structure:

         ----------     --------    |   ----------
         |  cpu0  |-----| DRAM |----|---| Node 0 |
         ----------     --------    |   ----------
        /         \                 |
    -------     -------             |   ----------
    | HB0 |-----| HB1 |-------------|---| Node 1 |
    -------     -------             |   ----------
      / \         / \               |
  CXL0   CXL1  CXL2  CXL3           |
```

And the respective decoder programming looks as follows
```
Decoders:
                             CXL  Root
                                 |
                             decoder0.0
                            IW:2   IG:256
                      [0x300000000, 0x3FFFFFFFF]
                             /         \
                Host Bridge 7           Host Bridge 6
                /                                    \
           decoder1.0                             decoder2.0
          IW:2   IG:512                          IW:2   IG:512
  [0x300000000, 0x3FFFFFFFFF]             [0x300000000, 0x3FFFFFFFF]
            /    \                                  /    \
   endpoint0      endpoint1                endpoint2      endpoint3
      |               |                       |               |
  decoder3.0      decoder4.0              decoder5.0      decoder6.0
          IW:4  IG:256                            IW:4  IG:256
  [0x300000000, 0x3FFFFFFFF]              [0x300000000, 0x3FFFFFFFF]
```

Notice at both the root and the host bridge, the Interleave Ways is 2.
There are two targets at each level.  The host bridge has a granularity
of 512 to capture its parent's ways and granularity (`2*256`).

Each decoder is programmed with the total number of targets (4) and the
overall granularity (256B).

We might use this setup if each CXL device is capable of x8 PCIE, and
we have 2 Host Bridges capable of full x16 - utilizing all bandwidth
available.

---------------------------------------------
Nuance: Hardware Interleave and Memory Holes.
---------------------------------------------
You may encounter a system which cannot place the entire memory capacity
into a single contiguous System Physical Address range.  That's ok,
because we can just use multiple decoders to capture this nuance.

Most CXL devices allow for multiple decoders.

This may require an SRAT entry to keep these regions on the same node.
(Obviously the relies on your platform vendor's BIOS)

```
CFMWS:
         Subtable Type : 01 [CXL Fixed Memory Window Structure]
                Reserved : 00
                  Length : 002C
                Reserved : 00000000
     Window base address : 0000000300000000   <- Memory Region
             Window size : 0000000080000000   <- 2GB
Interleave Members (2^n) : 00                 <- No host bridge interleave
   Interleave Arithmetic : 00
                Reserved : 0000
             Granularity : 00000000
            Restrictions : 0006               <- Bit(2) - Volatile
                   QtgId : 0001
            First Target : 00000007           <- Host Bridge 7

         Subtable Type : 01 [CXL Fixed Memory Window Structure]
                Reserved : 00
                  Length : 002C
                Reserved : 00000000
     Window base address : 0000000400000000   <- Memory Region
             Window size : 0000000080000000   <- 2GB
Interleave Members (2^n) : 00                 <- No host bridge interleave
   Interleave Arithmetic : 00
                Reserved : 0000
             Granularity : 00000000
            Restrictions : 0006               <- Bit(2) - Volatile
                   QtgId : 0001
            First Target : 00000007           <- Host Bridge 7

SRAT:
        Subtable Type : 01 [Memory Affinity]
               Length : 28
     Proximity Domain : 00000001          <- NUMA Node 1
            Reserved1 : 0000
         Base Address : 0000000300000000  <- Physical Memory Region
       Address Length : 0000000080000000  <- first 2GB

        Subtable Type : 01 [Memory Affinity]
               Length : 28
     Proximity Domain : 00000001          <- NUMA Node 1
            Reserved1 : 0000
         Base Address : 0000000400000000  <- Physical Memory Region
       Address Length : 0000000080000000  <- second 2GB
```

The SRAT entries allow us to keep the regions attached to the same node.
```

NUMA Structure:
        ---------     --------   |    ----------
        | cpu0  |-----| DRAM |---|----| Node 0 |
        ---------     --------   |    ----------
            |                    |
         -------                 |    ----------
         | HB0 |-----------------|----| Node 1 |
         -------                 |    ----------
        /       \                |
   CXL Dev     CXL Dev           |
```

And the decoder programming would look like so
```
Decoders:
                               CXL  Root
                             /           \
                    decoder0.0           decoder0.1
                  IW:1  IG:256           IW:1  IG:256
    [0x300000000, 0x37FFFFFFF]           [0x400000000, 0x47FFFFFFF]
                             \           /
                              Host Bridge
                             /           \
                    decoder1.0           decoder1.1
                  IW:2  IG:256           IW:2  IG:256
    [0x300000000, 0x37FFFFFFF]           [0x400000000, 0x47FFFFFFF]
              /   \                                /   \
    Endpoint 0     Endpoint 1            Endpoint 0     Endpoint 1
        |              |                     |              |
    decoder2.0     decoder3.0            decoder2.1     decoder3.1
            IW:2 IG:256                          IW:2 IG:256
    [0x300000000, 0x37FFFFFFF]           [0x400000000, 0x47FFFFFFF]
```

Linux manages decoders in relation to the associated component, so
decoders are N.M where N is the component and M is the decoder number.

If you look, you'll see each side of this tree looks individually
equivalent to the intra-host-bridge interleave example, just with one
half of the total memory each (matching the CFMWS ranges).

Each of the root decoders still has an interleave width of 1 because
they both only target one host bridge (despite it being the same one).

--------------------------------
Software Interleave (Mempolicy).
--------------------------------
Linux provides a software mechanism to allow tasks to to interleave its
memory across NUMA nodes - which may have different performance
characteristics.  This component is called `mempolicy`, and is primarily
operated on with the `set_mempolicy()` and `mbind()` syscalls.

These syscalls take a nodemask (bitmask representing NUMA node ids) as
an argument to describe the intended allocation policy of the task.

The following policies are presently supported (as of v6.13)
```
enum {
        MPOL_DEFAULT,
        MPOL_PREFERRED,
        MPOL_BIND,
        MPOL_INTERLEAVE,
        MPOL_LOCAL,
        MPOL_PREFERRED_MANY,
        MPOL_WEIGHTED_INTERLEAVE,
};
```

Let's look at `MPOL_INTERLEAVE` and `MPOL_WEIGHTED_INTERLEAVE`.

To quote the man page:
```
MPOL_INTERLEAVE
    This  mode  interleaves  page  allocations  across the nodes specified
    in nodemask in numeric node ID order.  This optimizes for bandwidth
    instead of latency by spreading out pages and memory accesses to those
    pages across multiple nodes.  However, accesses to a single page will
    still be limited to the memory bandwidth of a single node.

MPOL_WEIGHTED_INTERLEAVE (since Linux 6.9)
    This mode interleaves page allocations across the nodes specified in
    nodemask according to the weights in
        /sys/kernel/mm/mempolicy/weighted_interleave
    For example, if bits 0, 2, and 5 are set in nodemask and the contents of
        /sys/kernel/mm/mempolicy/weighted_interleave/node0
        /sys/ ... /node2
        /sys/ ... /node5
    are 4, 7, and 9, respectively, then pages in this region will be
    allocated on nodes 0, 2, and 5 in a 4:7:9 ratio.
```

To put it simply, MPOL_INTERLEAVE will interleave allocations at a page
granularity (4KB, 2MB, etc) across nodes in a 1:1 ratio, while
MPOL_WEIGHTED_INTERLEAVE takes into account weights - which presumably
map to the bandwidth of each respective node.

Or more concretely:

MPOL_INTERLEAVE
    1:1 Interleave between two nodes.
    malloc(4096)  ->  node0
    malloc(4096)  ->  node1
    malloc(4096)  ->  node0
    malloc(4096)  ->  node1
    ... and so on ...

MPOL_WEIGHTED_INTERLEAVE
    2:1 Interleave between two nodes.
    malloc(4096)  ->  node0
    malloc(4096)  ->  node0
    malloc(4096)  ->  node1
    malloc(4096)  ->  node0
    malloc(4096)  ->  node0
    malloc(4096)  ->  node1
    ... and so on ...

This is the preferred mechanism for *heterogeneous interleave* on Linux,
as it allows for predictable performance based on the explicit (and
visible) placement of memory.

It also allows for memory ZONE restrictions to enable better performance
predictability (e.g. keeping kernel locks out of CXL while allowing
workloads to leverage it for expansion or bandwidth).

======================
Mempolicy Limitations.
======================
Mempolicy is a *per-task* allocation policy that is inherited by
child-tasks on clone/fork. It can only be changed by the task itself,
though cgroups may affect the effective nodemask via cpusets.

This means once a task has been launched, and external actor cannot
change the policy of a running task - except possibly by migrating that
task between cgroups or changing the cpusets.mems value of the cgroup
the task lives in.

Additionally, If capacity on a given node is not available, allocations
will fall back to another node in the node mask - which may cause
interleave to become unbalanced.

================================
Hardware Interleave Limitations.
================================
Granularities:
   granularities are limited on hardware
   (typically 256B up to 16KB by power of 2)

Ways:
   Ways are limited by the CXL configuration to:
   2,4,8,16,3,6,12

Balance:
   Linux does not allow imbalanced interleave configurations
   (e.g. 3-way interleave where 2 targets are on 1 HB and 1 on another)

Depending on your platform vendor and type of interleave, you may not
be able to deconstruct an interleave region at all (decoders may be
locked).  In this case, you may not have the flexiblity to convert
operation from interleaved to non-interleave via the driver interface.

In the scenario where your interleave configuration is entirely driver
managed, you cannot adjust the size of an interleave set without
deconstructing the entire set.

------------------------------------------------------------------------

Next we'll discuss how memory allocations occur in a CXL-enabled system,
which may be affected by things like Reclaim and Tiering systems.

~Gregory