Re: [LSF/MM] CXL Boot to Bash - Section 0a: CFMWS and NUMA Flexiblity

Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx> · Thu, 13 Mar 2025 17:20:04 +0000

On Fri, 7 Mar 2025 22:23:05 -0500
Gregory Price <gourry@xxxxxxxxxx> wrote:

> In the last section we discussed how the CEDT CFMWS and SRAT Memory
> Affinity structures are used by linux to "create" NUMA nodes (or at
> least mark them as possible). However, the examples I used suggested
> that there was a 1-to-1 relationship between CFMWS and devices or
> host bridges.
> 
> This is not true - in fact, CFMWS are a simply a carve out of System
> Physical Address space which may be used to map any number of endpoint
> devices behind the associated Host Bridge(s).
> 
> The limiting factor is what your platform vendor BIOS supports.
> 
> This section describes a handful of *possible* configurations, what NUMA
> structure they will create, and what flexibility this provides.
> 
> All of these CFMWS configurations are made up, and may or may not exist
> in real machines. They are a conceptual teching tool, not a roadmap.
> 
> (When discussing interleave in this section, please note that I am
>  intentionally omitting details about decoder programming, as this
>  will be covered later.)
> 
> 
> -------------------------------
> One 2GB Device, Multiple CFMWS.
> -------------------------------
> Lets imagine we have one 2GB device attached to a host bridge.
> 
> In this example, the device hosts 2GB of persistent memory - but we
> might want the flexibility to map capacity as volatile or persistent.

Fairly sure we block persistent in a volatile CFMWS in the kernel.
Any bios actually does this?

You might have a variable partition device but I thought in kernel at
least we decided that no one was building that crazy?

Maybe a QoS split is a better example to motivate one range, two places?

> 
> The platform vendor may decide that they want to reserve two entirely
> separate system physical address ranges to represent the capacity.
> 
> ```
>            Subtable Type : 01 [CXL Fixed Memory Window Structure]
>                 Reserved : 00
>                   Length : 002C
>                 Reserved : 00000000
>      Window base address : 0000000100000000   <- Memory Region
>              Window size : 0000000080000000   <- 2GB
> Interleave Members (2^n) : 00
>    Interleave Arithmetic : 00
>                 Reserved : 0000
>              Granularity : 00000000
>             Restrictions : 0006               <- Bit(2) - Volatile
>                    QtgId : 0001
>             First Target : 00000007           <- Host Bridge _UID
> 
>            Subtable Type : 01 [CXL Fixed Memory Window Structure]
>                 Reserved : 00
>                   Length : 002C
>                 Reserved : 00000000
>      Window base address : 0000000200000000   <- Memory Region
>              Window size : 0000000080000000   <- 2GB
> Interleave Members (2^n) : 00
>    Interleave Arithmetic : 00
>                 Reserved : 0000
>              Granularity : 00000000
>             Restrictions : 000A               <- Bit(3) - Persistant
>                    QtgId : 0001
>             First Target : 00000007           <- Host Bridge _UID
> 
> NUMA effect: 2 nodes marked POSSIBLE (1 for each CFMWS)
> ```
> 
> You might have a CEDT with two CFMWS as above, where the base addresses
> are `0x100000000` and `0x200000000` respectively, but whose window sizes
> cover the entire 2GB capacity of the device.  This affords the user 
> flexibility in where the memory is mapped depending on if it is mapped
> as volatile or persistent while keeping the two SPA ranges separate.
> 
> This is allowed because the endpoint decoders commit device physical
> address space *in order*, meaning no two regions of device physical
> address space can be mapped to more than one system physical address.
> 
> i.e.: DPA(0) can only map to SPA(0x200000000) xor SPA(0x100000000)
> 
> (See Section 2a - decoder programming).
> 

> -------------------------------------------------------------
> Two Devices On One Host Bridge - With and Without Interleave.
> -------------------------------------------------------------
> What if we wanted some capacity on each endpoint hosted on its own NUMA
> node, and wanted to interleave a portion of each device capacity?

If anyone hits the lock on commit (i.e. annoying BIOS) the ordering
checks on HPA kick in here and restrict flexibility a lot
(assuming I understand them correctly that is)

This is a good illustration of why we should at some point revisit
multiple NUMA nodes per CFMWS.  We have to burn SPA space just
to get nodes.  From a spec point of view all that is needed here
is a single CFMWS. 

> 
> We could produce the following CFMWS configuration.
> ```
>            Subtable Type : 01 [CXL Fixed Memory Window Structure]
>                 Reserved : 00
>                   Length : 002C
>                 Reserved : 00000000
>      Window base address : 0000000100000000   <- Memory Region 1
>              Window size : 0000000080000000   <- 2GB
> Interleave Members (2^n) : 00
>    Interleave Arithmetic : 00
>                 Reserved : 0000
>              Granularity : 00000000
>             Restrictions : 0006               <- Bit(2) - Volatile
>                    QtgId : 0001
>             First Target : 00000007           <- Host Bridge _UID
> 
>            Subtable Type : 01 [CXL Fixed Memory Window Structure]
>                 Reserved : 00
>                   Length : 002C
>                 Reserved : 00000000
>      Window base address : 0000000200000000   <- Memory Region 2
>              Window size : 0000000080000000   <- 2GB
> Interleave Members (2^n) : 00
>    Interleave Arithmetic : 00
>                 Reserved : 0000
>              Granularity : 00000000
>             Restrictions : 0006               <- Bit(2) - Volatile
>                    QtgId : 0001
>             First Target : 00000007           <- Host Bridge _UID
> 
>            Subtable Type : 01 [CXL Fixed Memory Window Structure]
>                 Reserved : 00
>                   Length : 002C
>                 Reserved : 00000000
>      Window base address : 0000000300000000   <- Memory Region 3
>              Window size : 0000000100000000   <- 4GB
> Interleave Members (2^n) : 00
>    Interleave Arithmetic : 00
>                 Reserved : 0000
>              Granularity : 00000000
>             Restrictions : 0006               <- Bit(2) - Volatile
>                    QtgId : 0001
>             First Target : 00000007           <- Host Bridge _UID
> 
> NUMA effect: 3 nodes marked POSSIBLE (1 for each CFMWS)
> ```
> 
> In this configuration, we could still do what we did with the prior
> configuration (2 CFMWS), but we could also use the third root decoder
> to simplify decoder programming of interleave.
> 
> Since the third region has sufficient capacity (4GB) to cover both
> devices (2GB/each), we can actually associate the entire capacity of
> both devices in that region.
> 
> We'll discuss this decoder structure in-depth in Section 4.
>