Re: CXL Boot to Bash - Section 2a (Drivers): CXL Decoder Programming

"Zhijian Li (Fujitsu)" <lizhijian@xxxxxxxxxxx> · Fri, 7 Mar 2025 00:57:18 +0000

Hey Gregory,

Thank you so much for your detailed introduction to the entire CXL
software ecosystem, which I have thoroughly read. You are truly excellent.

On 07/03/2025 07:56, Gregory Price wrote:
> I decided to dig into decoder programming as as an addendum to the
> Driver section - where I said I *wouldn't* do this. It's important
> though, when discussing interleave. So alas, we should at least have
> some base understanding of what the heck decoders are actually doing.
> 
> This is not a regutitation of the spec, you can think of it closer to
> a "Theory of Operation" or whatever.  I will show discrete examples of
> how ACPI tables, system memory map, and decoders relate.
> 
> ----------------------------------------
> Definitions: Addresses and HDM Decoders.
> ----------------------------------------
> 
> An HDM Decoder can be thought shorthand as a "routing" mechanism,
> where the a Physical Address is used to determine one of:
> 
>    1) Fabric routing (i.e. which pipe to send a request down)
>    2) Address translation (Host to Device Physical Address)
> 
> In section 2, I referenced a simple device-to-decoder mapping:
> 
>      root    ---  decoder0.0   -- Root Port Decoder
>       |               |
>     port1    ---  decoder1.0   -- Host Bridge Decoder
>       |               |
>    endpoint0 ---  decoder2.0   -- Endpoint Decoder

Here, I noticed something that differs slightly from my understanding:
"root --- decoder0.0 -- Root Port Decoder."

 From the perspective of the Linux Driver, decoder0.0 usually refers to
associated a CFMWs. Moreover, according to Spec r3.1 Table 8-22 CXL HDM Decoder Capability,
the CXL Root Port (also known as R in the table) is not permitted to implement
the HDM decoder.

If I have misunderstood something, please let me know.

Thanks
Zhijian

> 
> Barring any special innovations (cough) - endpoint decoders should
> be the only decoders that actually "Translation" addresses - at least
> for basic volatile memory devices.
> 
> All other decoders (Root, Host Bridge, Switch, etc) should simply
> forward DMA requests with the original Physical Address intact to
> the correct downstream device.
> 
> For extra confusion, there are now 3 "Physical Address" domains
> 
> System Physical Address (SPA)
>    The physical address of some location according to linux.
>    This is the address you see in the system memory map.
> 
> Host Physical Address   (HPA)
>    An abstract address used by decoders (I'll explain later)
> 
> Device Physical Address (DPA)
>    A device-local physical address (e.g. if a device has 1TB of
>    memory, it's DPA range might be 0-0x10000000000)
> 
> 
> ----------------------------
> DMA Routing (No Interleave).
> ----------------------------
> Ok, we have some decoders and confusing physical address definitions,
> how does a DMA actually go from processor to DRAM via these decoders?
> 
> Lets consider our simple fabric with 256MB of memory at SPA base 4GB.
> 
> Lets assume this was all set up statically by BIOS.  We'd have the
> following CEDT CFMWS (See Section 0 - ACPI) and decoder programming.
> 
> ```
> CEDT
>             Subtable Type : 01 [CXL Fixed Memory Window Structure]
>                  Reserved : 00
>                    Length : 002C
>                  Reserved : 00000000
>       Window base address : 0000000100000000   <- Memory Region
>               Window size : 0000000010000000   <- 256MB
> Interleave Members (2^n) : 00                 <- Not interleaved
> 
> Memory Map:
>    [mem 0x0000000100000000-0x0000000110000000] usable    <- SPA
> 
> Decoders
>    root    ---  decoder0.0   -- range=[0x100000000, 0x110000000]
>     |               |
>   port1    ---  decoder1.0   -- range=[0x100000000, 0x110000000]
>     |               |
> endpoint0 ---  decoder2.0   -- range=[0x100000000, 0x110000000]
> ```
> 
> When the CPU accessed an address in this range, the memory controller
> will send the request down the CXL fabric. The following steps occur:
> 
>    0) CPU accesses SPA(0x101234567)
> 
>    1) root decoder identifies HPA(0x101234567) is valid and forwards
>       to host bridge associated with that address (port 1)
> 
>    2) host bridge decoder identifies HPA(0x101234567) is valid and
>       forwards to endpoint associated with that address (endpoint0)
> 
>    3) endpoint decoder identifies HPA(0x101234567) is valid and
>       translates that address to DPA(0x01234567).
> 
>    4) The endpoint device uses DPA(0x01234567) to fulfill the request.
> 
> In this scenario, our endpoint has a DPA range of (0, 0x10000000),
> but technically DPA address space is device-defined and may be sparse.
> 
> As you can see, the root and host bridge decoders simply "route" the
> access to the next appropriate hop, while the endpoint decoder actually
> does the translation work.
> 
> 
> What if instead, we had two 256MB endpoints on the same host bridge?
> 
> ```
> CEDT
>             Subtable Type : 01 [CXL Fixed Memory Window Structure]
>                  Reserved : 00
>                    Length : 002C
>                  Reserved : 00000000
>       Window base address : 0000000100000000   <- Memory Region
>               Window size : 0000000020000000   <- 512MB
> Interleave Members (2^n) : 00                 <- Not interleaved
> 
> Memory Map:
>    [mem 0x0000000100000000-0x0000000120000000] usable  <- SPA
> 
> Decoders
>                              decoder0.0
>                    range=[0x100000000, 0x120000000]
>                                  |
>                              decoder1.0
>                    range=[0x100000000, 0x120000000]
>                    /                              \
>              decoded2.0                        decoder3.0
>    range=[0x100000000, 0x110000000]   range=[0x110000000, 0x120000000]
> ```
> 
> We still only have a single root port and host bridge decoder that
> covers the entire 512MB range, but there are now 2 differently
> programmed endpoint decoders.
> 
> This makes the routing a little more obvious.  The root and host bridge
> decoders cover the entire SPA space (512MB), while the endpoint decoders
> only cover their own address space (256MB).
> 
> The host bridge in this case is responsible for routing the request to
> the correct endpoint.
> 
> 
> What if we had 2 endpoints, each attached to their own host bridges?
> In this case We'd have 2 root ports and host bridge decoders.
> 
> ```
> CEDT
>             Subtable Type : 01 [CXL Fixed Memory Window Structure]
>                  Reserved : 00
>                    Length : 002C
>                  Reserved : 00000000
>       Window base address : 0000000100000000   <- Memory Region 1
>               Window size : 0000000010000000   <- 256MB
> Interleave Members (2^n) : 00                 <- Not interleaved
>              First Target : 00000007           <- Host Bridge _UID
> 
>             Subtable Type : 01 [CXL Fixed Memory Window Structure]
>                  Reserved : 00
>                    Length : 002C
>                  Reserved : 00000000
>       Window base address : 0000000110000000   <- Memory Region 1
>               Window size : 0000000010000000   <- 256MB
> Interleave Members (2^n) : 00                 <- Not interleaved
>              First Target : 00000006           <- Host Bridge _UID
> 
> Memory Map - this may or may not be collapsed depending on Linux arch
>    [mem 0x0000000100000000-0x0000000110000000] usable  <- System Phys Address
>    [mem 0x0000000110000000-0x0000000120000000] usable  <- System Phys Address
> 
> Decoders
>              decoder0.0                     decoder1.0   - roots
>      [0x100000000, 0x110000000]     [0x110000000, 0x120000000]
>                  |                              |
>              decoder2.0                     decoder3.0   - host bridges
>      [0x100000000, 0x110000000]     [0x110000000, 0x120000000]
>                  |                              |
>              decoder4.0                     decoder5.0   - endpoints
>      [0x100000000, 0x110000000]     [0x110000000, 0x120000000]
> ```
> 
> This scenario looks functionally same as the first - with two distinct,
> non-overlapping sets of decoders (any given SPA may only be services by
> one device).  The platform memory controller is responsible for routing
> the address to the correct root decoder.
> 
> In Section 4 (Interleave) we'll discuss a bit how the interleave is
> accomplished - as this depends whether you are interleaving across
> host bridges (aggregation) or within a host bridge (bifurcation).
> 
> 
> 
> ---------------------------------------------
> Nuance: Host Physical Address... translation?
> ---------------------------------------------
> 
> You might have noticed that all the addresses in the examples I showed
> are direct subsets of their parent decoder address ranges.  The root is
> assigned a System Physical Address according to the system memory map,
> and all decoders under it are a subset of that range.
> 
> You may have even noticed routing steps suddenly change from SPA to HPA
> 
>    0) CPU accesses SPA(0x101234567)
> 
>    1) root decoder identifies HPA(0x101234567) is valid and forwards
>       to host bridge associated with that address (port 1)
> 
> So what the heck is a "Host Physical Address"?
> Why isn't everything just described as a "System Physical Address"?
> 
> CXL HDM decoders *definitionally* handle HPA to DPA translations.
> 
> That's it, that's the definition of an HPA.
> 
> On MOST systems, what you see in the memory map is an SPA, and SPA=HPA,
> so all the decoders will appear to be programmed with SPA.  The platform
> MAY perform translation before a request is routed to decoder complex.
> 
> I will cover an example of this in-depth in an interleave addendum.
> 
> So the answer is that some ambiguity exists regarding whether platforms
> can/should do translation prior to HDM decoders even being utilized.  So
> for the sake of making everything more complicated and confusing for very
> little value:
> 
> 1) decoders definitionally do "HPA to DPA" translation
> 2) most of the time "SPA=HPA"
> 3) so decoders mostly do "SPA to DPA" translation
> 
> If you're confused, that's ok, I was too - and still am.  But Hopefully
> between this section and Section 4 (Interleave) we can be marginally
> less confused together.
> 
> 
> -----------------------------------------------
> Nuance: Memory Holes and Hotplug Memory Blocks!
> -----------------------------------------------
> Help, BIOS split my memory device across non-contiguous memory regions!
> 
> ```
> CEDT
>             Subtable Type : 01 [CXL Fixed Memory Window Structure]
>                  Reserved : 00
>                    Length : 002C
>                  Reserved : 00000000
>       Window base address : 0000000100000000   <- Memory Region 1
>               Window size : 0000000080000000   <- 128MB
> Interleave Members (2^n) : 00                 <- Not interleaved
>              First Target : 00000007           <- Host Bridge _UID
> 
> CEDT
>             Subtable Type : 01 [CXL Fixed Memory Window Structure]
>                  Reserved : 00
>                    Length : 002C
>                  Reserved : 00000000
>       Window base address : 0000000110000000   <- Memory Region 1
>               Window size : 0000000080000000   <- 128MB
> Interleave Members (2^n) : 00                 <- Not interleaved
>              First Target : 00000007           <- Host Bridge _UID
> 
> Memory Map
>    [mem 0x0000000100000000-0x0000000107FFFFFF] usable  <- SPA
>    [mem 0x0000000108000000-0x000000010FFFFFFF] reserved
>    [mem 0x0000000110000000-0x0000000118000000] usable  <- SPA
> ```
> 
> Take a breath. Everything will be ok.
> 
> You can have multiple decoders at each point in the decoder complex!
> (Most devices should implement for multiple decoders).
> 
> ```
> Decoders
>                                Root Port 0
>                               /          \
>                      decoder0.0          decoder0.1
>      [0x100000000, 0x108000000]          [0x110000000, 0x118000000]
>                               \          /
>                              Host Bridge 7
>                               /          \
>                      decoder1.0          decoder1.1
>      [0x100000000, 0x108000000]          [0x110000000, 0x118000000]
>                               \          /
>                                Endpoint 0
>                               /          \
>                      decoder2.0          decoder2.1
>      [0x100000000, 0x108000000]          [0x110000000, 0x118000000]
> ```
> 
> If your BIOS adds a memory hole, it better also use multiple decoders.
> 
> Oh, wait, Section 2 and Section 3 allude to hotplug memory blocks
> having size and alignment issues!
> 
> If your BIOS adds a memory hole, it better also do it on Linux hotplug
> memory block alignment (2GB on x86) or you'll lose 1 hotplug memory
> block of capacity per CFMWS.
> 
> Oi, talk about some rough edges, right? :[
> 
> ---------------------------------------
> Nuance: BIOS vs OS Programmed Decoders.
> ---------------------------------------
> The driver can (and does) program these decoders.  However, it's
> entirely normal for BIOS/EFI to program decoders prior to OS init.
> 
> Earlier in section 2 I said:
>    Most associations built by the driver are done by validating decoders
> 
> What I meant by this is the driver does one of two things with decoders:
> 
>     1) Detects BIOS programmed decoders and sanity checks them.
>        If an unexpected configuration is found, it bails out.
>        This memory is not accessible if EFI_MEMORY_SP is set.
> 
>     2) Provide an interface for user policy configuration of the decoders
> 
> For the most part, the mechanism is the same.  This carve-out is to tell
> you if something isn't working, you should check whether the BIOS/EFI or
> driver programmed the decoders. It will help debug the issue quicker.
> 
>          In my experience, it's USUALLY a bad ACPI table.
> 
> This distinction will be more important in Section 4 (Interleave) when
> we discuss Inter-Host-Bridge and Intra-Host-Bridge interleave.
> 
> ~Gregory
>