(background reading as we build up complexity) Driver Management - Decoders, HPA/SPA, DAX, and RAS. The Drivers =========== ---------------------- The Story Up 'til Now. ---------------------- When we left the Platform arena, assuming we've configured with special purpose memory, we are left with an entry in the memory map like so: BIOS-e820: [mem 0x000000c050000000-0x000000fcefffffff] soft reserved /proc/iomem: c050000000-fcefffffff : Soft Reserved This resource (see mm/resource.c) is left unused until a driver comes along to actually surface it to allocators (or some other interface). In our case, the drivers involved (or at least the ones we'll reference) drivers/base/ : device probing, memory (block) hotplug drivers/acpi/ : device hotplug drivers/acpi/numa : NUMA ACPI table info (SRAT, CEDT, HMAT, ...) drivers/pci/ : PCI device probing drivers/cxl/ : CXL device probing drivers/dax/ : cxl device to memory resource association We don't necessarily care about the specifics of each driver, we'll focus on just the aspects that ultimately affect memory management. ------------------------------- Step 4: Basic build complexity. ------------------------------- To make a long story short: CXL Build Configurations: CONFIG_CXL_ACPI CONFIG_CXL_BUS CONFIG_CXL_MEM CONFIG_CXL_PCI CONFIG_CXL_PORT CONFIG_CXL_REGION DAX Build Configurations: CONFIG_DEV_DAX CONFIG_DEV_DAX_CXL CONFIG_DEV_DAX_KMEM Without all of these enabled, your journey will end up cut short because some piece of the probe process will stop progressing. The most common misconfiguration I run into is CONFIG_DEV_DAX_CXL not being enabled. You end up with memory regions without dax devices. [/sys/bus/cxl/devices]# ls dax_region0 decoder0.0 decoder1.0 decoder2.0 ..... dax_region1 decoder0.1 decoder1.1 decoder3.0 ..... ^^^ These dax regions require `CONFIG_DEV_DAX_CXL` enabled to fully surface as dax devices, which can then be converted to system ram. --------------------------------------------------------------- Step 5: The CXL driver associating devices and iomem resources. --------------------------------------------------------------- The CXL driver wires up the following devices: root : CXL root portN : An intermediate or endpoint destination for accesses memN : memory devices Each device in the heirarchy may have one or more decoders decoderN.M : Address routing and translation devices The driver will also create additional objects and associations regionN : device-to-iomem resource mapping dax_regionN : region-to-dax device mapping Most associations built by the driver are done by validating decoders against each other at each point in the heirarchy. Root decoders describe memory regions and route DMA to ports. Intermediate decoders route DMA through CXL fabric. Endpoint decoders translate addresses (Host to device). A Root port has 1 decoder per associated CFMW in the CEDT decoder0.0 -> `c050000000-fcefffffff : Soft Reserved` A region (iomem resource mapping) can be created for these decoders [/sys/bus/cxl/devices/region0]# cat resource size target0 0xc050000000 0x3ca0000000 decoder5.0 A dax_region surfaces these regions as a dax device [/sys/bus/cxl/devices/dax_region0/dax0.0]# cat resource 0xc050000000 So in a simple environment with 1 device, we end up with a mapping that looks something like this. root --- decoder0.0 --- region0 -- dax_region0 -- dax0 | | | port1 --- decoder1.0 | | | | endpoint0 --- decoder3.0--------/ Much of the complexity in region creation stems from validating decoder programming and associating regions with targets (endpoint decoders). The take-away from this section is the existence of "decoders", of which there may be an arbitrary number between the root and endpoint. This will be relevant when we talk about RAS (Poison) and Interleave. --------------------------------------------------------------- Step 6: DAX surfacing Memory Blocks - First bit of User Policy. --------------------------------------------------------------- The last step in surfacing memory to allocators is to convert a dax device into memory blocks. On most default kernel builds, dax devices are not automatically converted to SystemRAM. Policy Choices userland policy: daxctl default-online : CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE or CONFIG_MHP_DEFAULT_ONLINE_TYPE_* or memhp_default_state=* To convert a dax device to SystemRAM utilizing daxctl: daxctl online-memory dax0.0 [--no-movable] By default the memory will online into ZONE_MOVABLE The --no-movable option will online the memory in ZONE_NORMAL Alternatively, this can be done at Build or Boot time using CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE (v6.13 or below) CONFIG_MHP_DEFAULT_ONLINE_TYPE_* (v6.14 or above) memhp_default_state=* (boot param predating cxl) I will save the discussion of ZONE selection to the next section, which will cover more memory-hotplug specifics. At this point, the memory blocks are exposed to the kernel mm allocators and may be used as normal System RAM. --------------------------------------------------------- Second bit of nuanced complexity: Memory Block Alignment. --------------------------------------------------------- In section 1, we introduced CEDT / CFMW and how they map to iomem resources. In this section we discussed out we surface memory blocks to the kernel allocators. However, at no time did platform, arch code, and driver communicate about the expected size of a memory block. In most cases, the size of a memory block is defined by the architecture - unaware of CXL. On x86, for example, the heuristic for memory block size is: 1) user boot-arg value 2) Maximize size (up to 2GB) if operating on bare metal 3) Use smallest value that aligns with the end of memory The problem is that [SOFT RESERVED] memory is not considered in the alignment calculation - and not all [SOFT RESERVED] memory *should* be considered for alignment. In the case of our working example (real system, btw): Subtable Type : 01 [CXL Fixed Memory Window Structure] Window base address : 000000C050000000 Window size : 0000003CA0000000 The base is 256MB aligned (the minimum for the CXL Spec), and the window size is 512MB. This results in a loss of almost a full memory block worth of memory (~1280MB on the front, and ~512MB on the back). This is a loss of ~0.7% of capacity (1.5GB) for that region (121.25GB). [1] has been proposed to allow for drivers (specifically ACPI) to advise the memory hotplug system on the suggested alignment, and for arch code to choose how to utilize this advisement. [1] https://lore.kernel.org/linux-mm/20250127153405.3379117-1-gourry@xxxxxxxxxx/ -------------------------------------------------------------------- The Complexity story up til now (what's likely to show up in slides) -------------------------------------------------------------------- Platform and BIOS: May configure all the devices prior to kernel hand-off. May or may not support reconfiguring / hotplug. BIOS and EFI: EFI_MEMORY_SP - used to defer management to drivers Kernel Build and Boot: CONFIG_EFI_SOFT_RESERVE=n - Will always result in CXL as SystemRAM nosoftreserve - Will always result in CXL as SystemRAM kexec - SystemRAM configs carry over to target Driver Build Options Required CONFIG_CXL_ACPI CONFIG_CXL_BUS CONFIG_CXL_MEM CONFIG_CXL_PCI CONFIG_CXL_PORT CONFIG_CXL_REGION CONFIG_DEV_DAX CONFIG_DEV_DAX_CXL CONFIG_DEV_DAX_KMEM User Policy CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE (<=v6.13) CONFIG_MHP_DEFAULT_ONLINE_TYPE (>=v6.14) memhp_default_state (boot param) daxctl online-memory daxN.Y (userland) Nuances Early-boot resource re-use Memory Block Alignment -------------------------------------------------------------------- Next Up: Memory (Block) Hotplug - Zones and Kernel Use of CXL RAS - Poison, MCE, and why you probably want CXL=ZONE_MOVABLE Interleave - RAS and Region Management (Hotplug-ability) ~Gregory