On Fri, 7 Mar 2025 22:23:05 -0500 Gregory Price <gourry@xxxxxxxxxx> wrote: > In the last section we discussed how the CEDT CFMWS and SRAT Memory > Affinity structures are used by linux to "create" NUMA nodes (or at > least mark them as possible). However, the examples I used suggested > that there was a 1-to-1 relationship between CFMWS and devices or > host bridges. > > This is not true - in fact, CFMWS are a simply a carve out of System > Physical Address space which may be used to map any number of endpoint > devices behind the associated Host Bridge(s). > > The limiting factor is what your platform vendor BIOS supports. > > This section describes a handful of *possible* configurations, what NUMA > structure they will create, and what flexibility this provides. > > All of these CFMWS configurations are made up, and may or may not exist > in real machines. They are a conceptual teching tool, not a roadmap. > > (When discussing interleave in this section, please note that I am > intentionally omitting details about decoder programming, as this > will be covered later.) > > > ------------------------------- > One 2GB Device, Multiple CFMWS. > ------------------------------- > Lets imagine we have one 2GB device attached to a host bridge. > > In this example, the device hosts 2GB of persistent memory - but we > might want the flexibility to map capacity as volatile or persistent. Fairly sure we block persistent in a volatile CFMWS in the kernel. Any bios actually does this? You might have a variable partition device but I thought in kernel at least we decided that no one was building that crazy? Maybe a QoS split is a better example to motivate one range, two places? > > The platform vendor may decide that they want to reserve two entirely > separate system physical address ranges to represent the capacity. > > ``` > Subtable Type : 01 [CXL Fixed Memory Window Structure] > Reserved : 00 > Length : 002C > Reserved : 00000000 > Window base address : 0000000100000000 <- Memory Region > Window size : 0000000080000000 <- 2GB > Interleave Members (2^n) : 00 > Interleave Arithmetic : 00 > Reserved : 0000 > Granularity : 00000000 > Restrictions : 0006 <- Bit(2) - Volatile > QtgId : 0001 > First Target : 00000007 <- Host Bridge _UID > > Subtable Type : 01 [CXL Fixed Memory Window Structure] > Reserved : 00 > Length : 002C > Reserved : 00000000 > Window base address : 0000000200000000 <- Memory Region > Window size : 0000000080000000 <- 2GB > Interleave Members (2^n) : 00 > Interleave Arithmetic : 00 > Reserved : 0000 > Granularity : 00000000 > Restrictions : 000A <- Bit(3) - Persistant > QtgId : 0001 > First Target : 00000007 <- Host Bridge _UID > > NUMA effect: 2 nodes marked POSSIBLE (1 for each CFMWS) > ``` > > You might have a CEDT with two CFMWS as above, where the base addresses > are `0x100000000` and `0x200000000` respectively, but whose window sizes > cover the entire 2GB capacity of the device. This affords the user > flexibility in where the memory is mapped depending on if it is mapped > as volatile or persistent while keeping the two SPA ranges separate. > > This is allowed because the endpoint decoders commit device physical > address space *in order*, meaning no two regions of device physical > address space can be mapped to more than one system physical address. > > i.e.: DPA(0) can only map to SPA(0x200000000) xor SPA(0x100000000) > > (See Section 2a - decoder programming). > > ------------------------------------------------------------- > Two Devices On One Host Bridge - With and Without Interleave. > ------------------------------------------------------------- > What if we wanted some capacity on each endpoint hosted on its own NUMA > node, and wanted to interleave a portion of each device capacity? If anyone hits the lock on commit (i.e. annoying BIOS) the ordering checks on HPA kick in here and restrict flexibility a lot (assuming I understand them correctly that is) This is a good illustration of why we should at some point revisit multiple NUMA nodes per CFMWS. We have to burn SPA space just to get nodes. From a spec point of view all that is needed here is a single CFMWS. > > We could produce the following CFMWS configuration. > ``` > Subtable Type : 01 [CXL Fixed Memory Window Structure] > Reserved : 00 > Length : 002C > Reserved : 00000000 > Window base address : 0000000100000000 <- Memory Region 1 > Window size : 0000000080000000 <- 2GB > Interleave Members (2^n) : 00 > Interleave Arithmetic : 00 > Reserved : 0000 > Granularity : 00000000 > Restrictions : 0006 <- Bit(2) - Volatile > QtgId : 0001 > First Target : 00000007 <- Host Bridge _UID > > Subtable Type : 01 [CXL Fixed Memory Window Structure] > Reserved : 00 > Length : 002C > Reserved : 00000000 > Window base address : 0000000200000000 <- Memory Region 2 > Window size : 0000000080000000 <- 2GB > Interleave Members (2^n) : 00 > Interleave Arithmetic : 00 > Reserved : 0000 > Granularity : 00000000 > Restrictions : 0006 <- Bit(2) - Volatile > QtgId : 0001 > First Target : 00000007 <- Host Bridge _UID > > Subtable Type : 01 [CXL Fixed Memory Window Structure] > Reserved : 00 > Length : 002C > Reserved : 00000000 > Window base address : 0000000300000000 <- Memory Region 3 > Window size : 0000000100000000 <- 4GB > Interleave Members (2^n) : 00 > Interleave Arithmetic : 00 > Reserved : 0000 > Granularity : 00000000 > Restrictions : 0006 <- Bit(2) - Volatile > QtgId : 0001 > First Target : 00000007 <- Host Bridge _UID > > NUMA effect: 3 nodes marked POSSIBLE (1 for each CFMWS) > ``` > > In this configuration, we could still do what we did with the prior > configuration (2 CFMWS), but we could also use the third root decoder > to simplify decoder programming of interleave. > > Since the third region has sufficient capacity (4GB) to cover both > devices (2GB/each), we can actually associate the entire capacity of > both devices in that region. > > We'll discuss this decoder structure in-depth in Section 4. >