I decided to dig into decoder programming as as an addendum to the Driver section - where I said I *wouldn't* do this. It's important though, when discussing interleave. So alas, we should at least have some base understanding of what the heck decoders are actually doing. This is not a regutitation of the spec, you can think of it closer to a "Theory of Operation" or whatever. I will show discrete examples of how ACPI tables, system memory map, and decoders relate. ---------------------------------------- Definitions: Addresses and HDM Decoders. ---------------------------------------- An HDM Decoder can be thought shorthand as a "routing" mechanism, where the a Physical Address is used to determine one of: 1) Fabric routing (i.e. which pipe to send a request down) 2) Address translation (Host to Device Physical Address) In section 2, I referenced a simple device-to-decoder mapping: root --- decoder0.0 -- Root Port Decoder | | port1 --- decoder1.0 -- Host Bridge Decoder | | endpoint0 --- decoder2.0 -- Endpoint Decoder Barring any special innovations (cough) - endpoint decoders should be the only decoders that actually "Translation" addresses - at least for basic volatile memory devices. All other decoders (Root, Host Bridge, Switch, etc) should simply forward DMA requests with the original Physical Address intact to the correct downstream device. For extra confusion, there are now 3 "Physical Address" domains System Physical Address (SPA) The physical address of some location according to linux. This is the address you see in the system memory map. Host Physical Address (HPA) An abstract address used by decoders (I'll explain later) Device Physical Address (DPA) A device-local physical address (e.g. if a device has 1TB of memory, it's DPA range might be 0-0x10000000000) ---------------------------- DMA Routing (No Interleave). ---------------------------- Ok, we have some decoders and confusing physical address definitions, how does a DMA actually go from processor to DRAM via these decoders? Lets consider our simple fabric with 256MB of memory at SPA base 4GB. Lets assume this was all set up statically by BIOS. We'd have the following CEDT CFMWS (See Section 0 - ACPI) and decoder programming. ``` CEDT Subtable Type : 01 [CXL Fixed Memory Window Structure] Reserved : 00 Length : 002C Reserved : 00000000 Window base address : 0000000100000000 <- Memory Region Window size : 0000000010000000 <- 256MB Interleave Members (2^n) : 00 <- Not interleaved Memory Map: [mem 0x0000000100000000-0x0000000110000000] usable <- SPA Decoders root --- decoder0.0 -- range=[0x100000000, 0x110000000] | | port1 --- decoder1.0 -- range=[0x100000000, 0x110000000] | | endpoint0 --- decoder2.0 -- range=[0x100000000, 0x110000000] ``` When the CPU accessed an address in this range, the memory controller will send the request down the CXL fabric. The following steps occur: 0) CPU accesses SPA(0x101234567) 1) root decoder identifies HPA(0x101234567) is valid and forwards to host bridge associated with that address (port 1) 2) host bridge decoder identifies HPA(0x101234567) is valid and forwards to endpoint associated with that address (endpoint0) 3) endpoint decoder identifies HPA(0x101234567) is valid and translates that address to DPA(0x01234567). 4) The endpoint device uses DPA(0x01234567) to fulfill the request. In this scenario, our endpoint has a DPA range of (0, 0x10000000), but technically DPA address space is device-defined and may be sparse. As you can see, the root and host bridge decoders simply "route" the access to the next appropriate hop, while the endpoint decoder actually does the translation work. What if instead, we had two 256MB endpoints on the same host bridge? ``` CEDT Subtable Type : 01 [CXL Fixed Memory Window Structure] Reserved : 00 Length : 002C Reserved : 00000000 Window base address : 0000000100000000 <- Memory Region Window size : 0000000020000000 <- 512MB Interleave Members (2^n) : 00 <- Not interleaved Memory Map: [mem 0x0000000100000000-0x0000000120000000] usable <- SPA Decoders decoder0.0 range=[0x100000000, 0x120000000] | decoder1.0 range=[0x100000000, 0x120000000] / \ decoded2.0 decoder3.0 range=[0x100000000, 0x110000000] range=[0x110000000, 0x120000000] ``` We still only have a single root port and host bridge decoder that covers the entire 512MB range, but there are now 2 differently programmed endpoint decoders. This makes the routing a little more obvious. The root and host bridge decoders cover the entire SPA space (512MB), while the endpoint decoders only cover their own address space (256MB). The host bridge in this case is responsible for routing the request to the correct endpoint. What if we had 2 endpoints, each attached to their own host bridges? In this case We'd have 2 root ports and host bridge decoders. ``` CEDT Subtable Type : 01 [CXL Fixed Memory Window Structure] Reserved : 00 Length : 002C Reserved : 00000000 Window base address : 0000000100000000 <- Memory Region 1 Window size : 0000000010000000 <- 256MB Interleave Members (2^n) : 00 <- Not interleaved First Target : 00000007 <- Host Bridge _UID Subtable Type : 01 [CXL Fixed Memory Window Structure] Reserved : 00 Length : 002C Reserved : 00000000 Window base address : 0000000110000000 <- Memory Region 1 Window size : 0000000010000000 <- 256MB Interleave Members (2^n) : 00 <- Not interleaved First Target : 00000006 <- Host Bridge _UID Memory Map - this may or may not be collapsed depending on Linux arch [mem 0x0000000100000000-0x0000000110000000] usable <- System Phys Address [mem 0x0000000110000000-0x0000000120000000] usable <- System Phys Address Decoders decoder0.0 decoder1.0 - roots [0x100000000, 0x110000000] [0x110000000, 0x120000000] | | decoder2.0 decoder3.0 - host bridges [0x100000000, 0x110000000] [0x110000000, 0x120000000] | | decoder4.0 decoder5.0 - endpoints [0x100000000, 0x110000000] [0x110000000, 0x120000000] ``` This scenario looks functionally same as the first - with two distinct, non-overlapping sets of decoders (any given SPA may only be services by one device). The platform memory controller is responsible for routing the address to the correct root decoder. In Section 4 (Interleave) we'll discuss a bit how the interleave is accomplished - as this depends whether you are interleaving across host bridges (aggregation) or within a host bridge (bifurcation). --------------------------------------------- Nuance: Host Physical Address... translation? --------------------------------------------- You might have noticed that all the addresses in the examples I showed are direct subsets of their parent decoder address ranges. The root is assigned a System Physical Address according to the system memory map, and all decoders under it are a subset of that range. You may have even noticed routing steps suddenly change from SPA to HPA 0) CPU accesses SPA(0x101234567) 1) root decoder identifies HPA(0x101234567) is valid and forwards to host bridge associated with that address (port 1) So what the heck is a "Host Physical Address"? Why isn't everything just described as a "System Physical Address"? CXL HDM decoders *definitionally* handle HPA to DPA translations. That's it, that's the definition of an HPA. On MOST systems, what you see in the memory map is an SPA, and SPA=HPA, so all the decoders will appear to be programmed with SPA. The platform MAY perform translation before a request is routed to decoder complex. I will cover an example of this in-depth in an interleave addendum. So the answer is that some ambiguity exists regarding whether platforms can/should do translation prior to HDM decoders even being utilized. So for the sake of making everything more complicated and confusing for very little value: 1) decoders definitionally do "HPA to DPA" translation 2) most of the time "SPA=HPA" 3) so decoders mostly do "SPA to DPA" translation If you're confused, that's ok, I was too - and still am. But Hopefully between this section and Section 4 (Interleave) we can be marginally less confused together. ----------------------------------------------- Nuance: Memory Holes and Hotplug Memory Blocks! ----------------------------------------------- Help, BIOS split my memory device across non-contiguous memory regions! ``` CEDT Subtable Type : 01 [CXL Fixed Memory Window Structure] Reserved : 00 Length : 002C Reserved : 00000000 Window base address : 0000000100000000 <- Memory Region 1 Window size : 0000000080000000 <- 128MB Interleave Members (2^n) : 00 <- Not interleaved First Target : 00000007 <- Host Bridge _UID CEDT Subtable Type : 01 [CXL Fixed Memory Window Structure] Reserved : 00 Length : 002C Reserved : 00000000 Window base address : 0000000110000000 <- Memory Region 1 Window size : 0000000080000000 <- 128MB Interleave Members (2^n) : 00 <- Not interleaved First Target : 00000007 <- Host Bridge _UID Memory Map [mem 0x0000000100000000-0x0000000107FFFFFF] usable <- SPA [mem 0x0000000108000000-0x000000010FFFFFFF] reserved [mem 0x0000000110000000-0x0000000118000000] usable <- SPA ``` Take a breath. Everything will be ok. You can have multiple decoders at each point in the decoder complex! (Most devices should implement for multiple decoders). ``` Decoders Root Port 0 / \ decoder0.0 decoder0.1 [0x100000000, 0x108000000] [0x110000000, 0x118000000] \ / Host Bridge 7 / \ decoder1.0 decoder1.1 [0x100000000, 0x108000000] [0x110000000, 0x118000000] \ / Endpoint 0 / \ decoder2.0 decoder2.1 [0x100000000, 0x108000000] [0x110000000, 0x118000000] ``` If your BIOS adds a memory hole, it better also use multiple decoders. Oh, wait, Section 2 and Section 3 allude to hotplug memory blocks having size and alignment issues! If your BIOS adds a memory hole, it better also do it on Linux hotplug memory block alignment (2GB on x86) or you'll lose 1 hotplug memory block of capacity per CFMWS. Oi, talk about some rough edges, right? :[ --------------------------------------- Nuance: BIOS vs OS Programmed Decoders. --------------------------------------- The driver can (and does) program these decoders. However, it's entirely normal for BIOS/EFI to program decoders prior to OS init. Earlier in section 2 I said: Most associations built by the driver are done by validating decoders What I meant by this is the driver does one of two things with decoders: 1) Detects BIOS programmed decoders and sanity checks them. If an unexpected configuration is found, it bails out. This memory is not accessible if EFI_MEMORY_SP is set. 2) Provide an interface for user policy configuration of the decoders For the most part, the mechanism is the same. This carve-out is to tell you if something isn't working, you should check whether the BIOS/EFI or driver programmed the decoders. It will help debug the issue quicker. In my experience, it's USUALLY a bad ACPI table. This distinction will be more important in Section 4 (Interleave) when we discuss Inter-Host-Bridge and Intra-Host-Bridge interleave. ~Gregory