Hey Gregory, Thank you so much for your detailed introduction to the entire CXL software ecosystem, which I have thoroughly read. You are truly excellent. On 07/03/2025 07:56, Gregory Price wrote: > I decided to dig into decoder programming as as an addendum to the > Driver section - where I said I *wouldn't* do this. It's important > though, when discussing interleave. So alas, we should at least have > some base understanding of what the heck decoders are actually doing. > > This is not a regutitation of the spec, you can think of it closer to > a "Theory of Operation" or whatever. I will show discrete examples of > how ACPI tables, system memory map, and decoders relate. > > ---------------------------------------- > Definitions: Addresses and HDM Decoders. > ---------------------------------------- > > An HDM Decoder can be thought shorthand as a "routing" mechanism, > where the a Physical Address is used to determine one of: > > 1) Fabric routing (i.e. which pipe to send a request down) > 2) Address translation (Host to Device Physical Address) > > In section 2, I referenced a simple device-to-decoder mapping: > > root --- decoder0.0 -- Root Port Decoder > | | > port1 --- decoder1.0 -- Host Bridge Decoder > | | > endpoint0 --- decoder2.0 -- Endpoint Decoder Here, I noticed something that differs slightly from my understanding: "root --- decoder0.0 -- Root Port Decoder." From the perspective of the Linux Driver, decoder0.0 usually refers to associated a CFMWs. Moreover, according to Spec r3.1 Table 8-22 CXL HDM Decoder Capability, the CXL Root Port (also known as R in the table) is not permitted to implement the HDM decoder. If I have misunderstood something, please let me know. Thanks Zhijian > > Barring any special innovations (cough) - endpoint decoders should > be the only decoders that actually "Translation" addresses - at least > for basic volatile memory devices. > > All other decoders (Root, Host Bridge, Switch, etc) should simply > forward DMA requests with the original Physical Address intact to > the correct downstream device. > > For extra confusion, there are now 3 "Physical Address" domains > > System Physical Address (SPA) > The physical address of some location according to linux. > This is the address you see in the system memory map. > > Host Physical Address (HPA) > An abstract address used by decoders (I'll explain later) > > Device Physical Address (DPA) > A device-local physical address (e.g. if a device has 1TB of > memory, it's DPA range might be 0-0x10000000000) > > > ---------------------------- > DMA Routing (No Interleave). > ---------------------------- > Ok, we have some decoders and confusing physical address definitions, > how does a DMA actually go from processor to DRAM via these decoders? > > Lets consider our simple fabric with 256MB of memory at SPA base 4GB. > > Lets assume this was all set up statically by BIOS. We'd have the > following CEDT CFMWS (See Section 0 - ACPI) and decoder programming. > > ``` > CEDT > Subtable Type : 01 [CXL Fixed Memory Window Structure] > Reserved : 00 > Length : 002C > Reserved : 00000000 > Window base address : 0000000100000000 <- Memory Region > Window size : 0000000010000000 <- 256MB > Interleave Members (2^n) : 00 <- Not interleaved > > Memory Map: > [mem 0x0000000100000000-0x0000000110000000] usable <- SPA > > Decoders > root --- decoder0.0 -- range=[0x100000000, 0x110000000] > | | > port1 --- decoder1.0 -- range=[0x100000000, 0x110000000] > | | > endpoint0 --- decoder2.0 -- range=[0x100000000, 0x110000000] > ``` > > When the CPU accessed an address in this range, the memory controller > will send the request down the CXL fabric. The following steps occur: > > 0) CPU accesses SPA(0x101234567) > > 1) root decoder identifies HPA(0x101234567) is valid and forwards > to host bridge associated with that address (port 1) > > 2) host bridge decoder identifies HPA(0x101234567) is valid and > forwards to endpoint associated with that address (endpoint0) > > 3) endpoint decoder identifies HPA(0x101234567) is valid and > translates that address to DPA(0x01234567). > > 4) The endpoint device uses DPA(0x01234567) to fulfill the request. > > In this scenario, our endpoint has a DPA range of (0, 0x10000000), > but technically DPA address space is device-defined and may be sparse. > > As you can see, the root and host bridge decoders simply "route" the > access to the next appropriate hop, while the endpoint decoder actually > does the translation work. > > > What if instead, we had two 256MB endpoints on the same host bridge? > > ``` > CEDT > Subtable Type : 01 [CXL Fixed Memory Window Structure] > Reserved : 00 > Length : 002C > Reserved : 00000000 > Window base address : 0000000100000000 <- Memory Region > Window size : 0000000020000000 <- 512MB > Interleave Members (2^n) : 00 <- Not interleaved > > Memory Map: > [mem 0x0000000100000000-0x0000000120000000] usable <- SPA > > Decoders > decoder0.0 > range=[0x100000000, 0x120000000] > | > decoder1.0 > range=[0x100000000, 0x120000000] > / \ > decoded2.0 decoder3.0 > range=[0x100000000, 0x110000000] range=[0x110000000, 0x120000000] > ``` > > We still only have a single root port and host bridge decoder that > covers the entire 512MB range, but there are now 2 differently > programmed endpoint decoders. > > This makes the routing a little more obvious. The root and host bridge > decoders cover the entire SPA space (512MB), while the endpoint decoders > only cover their own address space (256MB). > > The host bridge in this case is responsible for routing the request to > the correct endpoint. > > > What if we had 2 endpoints, each attached to their own host bridges? > In this case We'd have 2 root ports and host bridge decoders. > > ``` > CEDT > Subtable Type : 01 [CXL Fixed Memory Window Structure] > Reserved : 00 > Length : 002C > Reserved : 00000000 > Window base address : 0000000100000000 <- Memory Region 1 > Window size : 0000000010000000 <- 256MB > Interleave Members (2^n) : 00 <- Not interleaved > First Target : 00000007 <- Host Bridge _UID > > Subtable Type : 01 [CXL Fixed Memory Window Structure] > Reserved : 00 > Length : 002C > Reserved : 00000000 > Window base address : 0000000110000000 <- Memory Region 1 > Window size : 0000000010000000 <- 256MB > Interleave Members (2^n) : 00 <- Not interleaved > First Target : 00000006 <- Host Bridge _UID > > Memory Map - this may or may not be collapsed depending on Linux arch > [mem 0x0000000100000000-0x0000000110000000] usable <- System Phys Address > [mem 0x0000000110000000-0x0000000120000000] usable <- System Phys Address > > Decoders > decoder0.0 decoder1.0 - roots > [0x100000000, 0x110000000] [0x110000000, 0x120000000] > | | > decoder2.0 decoder3.0 - host bridges > [0x100000000, 0x110000000] [0x110000000, 0x120000000] > | | > decoder4.0 decoder5.0 - endpoints > [0x100000000, 0x110000000] [0x110000000, 0x120000000] > ``` > > This scenario looks functionally same as the first - with two distinct, > non-overlapping sets of decoders (any given SPA may only be services by > one device). The platform memory controller is responsible for routing > the address to the correct root decoder. > > In Section 4 (Interleave) we'll discuss a bit how the interleave is > accomplished - as this depends whether you are interleaving across > host bridges (aggregation) or within a host bridge (bifurcation). > > > > --------------------------------------------- > Nuance: Host Physical Address... translation? > --------------------------------------------- > > You might have noticed that all the addresses in the examples I showed > are direct subsets of their parent decoder address ranges. The root is > assigned a System Physical Address according to the system memory map, > and all decoders under it are a subset of that range. > > You may have even noticed routing steps suddenly change from SPA to HPA > > 0) CPU accesses SPA(0x101234567) > > 1) root decoder identifies HPA(0x101234567) is valid and forwards > to host bridge associated with that address (port 1) > > So what the heck is a "Host Physical Address"? > Why isn't everything just described as a "System Physical Address"? > > CXL HDM decoders *definitionally* handle HPA to DPA translations. > > That's it, that's the definition of an HPA. > > On MOST systems, what you see in the memory map is an SPA, and SPA=HPA, > so all the decoders will appear to be programmed with SPA. The platform > MAY perform translation before a request is routed to decoder complex. > > I will cover an example of this in-depth in an interleave addendum. > > So the answer is that some ambiguity exists regarding whether platforms > can/should do translation prior to HDM decoders even being utilized. So > for the sake of making everything more complicated and confusing for very > little value: > > 1) decoders definitionally do "HPA to DPA" translation > 2) most of the time "SPA=HPA" > 3) so decoders mostly do "SPA to DPA" translation > > If you're confused, that's ok, I was too - and still am. But Hopefully > between this section and Section 4 (Interleave) we can be marginally > less confused together. > > > ----------------------------------------------- > Nuance: Memory Holes and Hotplug Memory Blocks! > ----------------------------------------------- > Help, BIOS split my memory device across non-contiguous memory regions! > > ``` > CEDT > Subtable Type : 01 [CXL Fixed Memory Window Structure] > Reserved : 00 > Length : 002C > Reserved : 00000000 > Window base address : 0000000100000000 <- Memory Region 1 > Window size : 0000000080000000 <- 128MB > Interleave Members (2^n) : 00 <- Not interleaved > First Target : 00000007 <- Host Bridge _UID > > CEDT > Subtable Type : 01 [CXL Fixed Memory Window Structure] > Reserved : 00 > Length : 002C > Reserved : 00000000 > Window base address : 0000000110000000 <- Memory Region 1 > Window size : 0000000080000000 <- 128MB > Interleave Members (2^n) : 00 <- Not interleaved > First Target : 00000007 <- Host Bridge _UID > > Memory Map > [mem 0x0000000100000000-0x0000000107FFFFFF] usable <- SPA > [mem 0x0000000108000000-0x000000010FFFFFFF] reserved > [mem 0x0000000110000000-0x0000000118000000] usable <- SPA > ``` > > Take a breath. Everything will be ok. > > You can have multiple decoders at each point in the decoder complex! > (Most devices should implement for multiple decoders). > > ``` > Decoders > Root Port 0 > / \ > decoder0.0 decoder0.1 > [0x100000000, 0x108000000] [0x110000000, 0x118000000] > \ / > Host Bridge 7 > / \ > decoder1.0 decoder1.1 > [0x100000000, 0x108000000] [0x110000000, 0x118000000] > \ / > Endpoint 0 > / \ > decoder2.0 decoder2.1 > [0x100000000, 0x108000000] [0x110000000, 0x118000000] > ``` > > If your BIOS adds a memory hole, it better also use multiple decoders. > > Oh, wait, Section 2 and Section 3 allude to hotplug memory blocks > having size and alignment issues! > > If your BIOS adds a memory hole, it better also do it on Linux hotplug > memory block alignment (2GB on x86) or you'll lose 1 hotplug memory > block of capacity per CFMWS. > > Oi, talk about some rough edges, right? :[ > > --------------------------------------- > Nuance: BIOS vs OS Programmed Decoders. > --------------------------------------- > The driver can (and does) program these decoders. However, it's > entirely normal for BIOS/EFI to program decoders prior to OS init. > > Earlier in section 2 I said: > Most associations built by the driver are done by validating decoders > > What I meant by this is the driver does one of two things with decoders: > > 1) Detects BIOS programmed decoders and sanity checks them. > If an unexpected configuration is found, it bails out. > This memory is not accessible if EFI_MEMORY_SP is set. > > 2) Provide an interface for user policy configuration of the decoders > > For the most part, the mechanism is the same. This carve-out is to tell > you if something isn't working, you should check whether the BIOS/EFI or > driver programmed the decoders. It will help debug the issue quicker. > > In my experience, it's USUALLY a bad ACPI table. > > This distinction will be more important in Section 4 (Interleave) when > we discuss Inter-Host-Bridge and Intra-Host-Bridge interleave. > > ~Gregory >