Gregory Price wrote: > (background reading as we build up complexity) Thanks for this taxonomy! > > Driver Management - Decoders, HPA/SPA, DAX, and RAS. > > The Drivers > =========== > ---------------------- > The Story Up 'til Now. > ---------------------- > > When we left the Platform arena, assuming we've configured with special > purpose memory, we are left with an entry in the memory map like so: > > BIOS-e820: [mem 0x000000c050000000-0x000000fcefffffff] soft reserved > /proc/iomem: c050000000-fcefffffff : Soft Reserved > > This resource (see mm/resource.c) is left unused until a driver comes > along to actually surface it to allocators (or some other interface). > > In our case, the drivers involved (or at least the ones we'll reference) > > drivers/base/ : device probing, memory (block) hotplug > drivers/acpi/ : device hotplug > drivers/acpi/numa : NUMA ACPI table info (SRAT, CEDT, HMAT, ...) > drivers/pci/ : PCI device probing > drivers/cxl/ : CXL device probing > drivers/dax/ : cxl device to memory resource association > > We don't necessarily care about the specifics of each driver, we'll > focus on just the aspects that ultimately affect memory management. > > ------------------------------- > Step 4: Basic build complexity. > ------------------------------- > To make a long story short: > > CXL Build Configurations: > CONFIG_CXL_ACPI > CONFIG_CXL_BUS > CONFIG_CXL_MEM > CONFIG_CXL_PCI > CONFIG_CXL_PORT > CONFIG_CXL_REGION > > DAX Build Configurations: > CONFIG_DEV_DAX > CONFIG_DEV_DAX_CXL > CONFIG_DEV_DAX_KMEM > > Without all of these enabled, your journey will end up cut short because > some piece of the probe process will stop progressing. > > The most common misconfiguration I run into is CONFIG_DEV_DAX_CXL not > being enabled. You end up with memory regions without dax devices. > > [/sys/bus/cxl/devices]# ls > dax_region0 decoder0.0 decoder1.0 decoder2.0 ..... > dax_region1 decoder0.1 decoder1.1 decoder3.0 ..... > > ^^^ These dax regions require `CONFIG_DEV_DAX_CXL` enabled to fully > surface as dax devices, which can then be converted to system ram. At least for this problem the plan is to fall back to CONFIG_DEV_DAX_HMEM [1] which skips all of the RAS and device enumeration benefits and just shunts EFI_MEMORY_SP over to device_dax. There is also the panic button of efi=nosoftreserve which is the flag of surrender if the kernel fails to parse the CXL configuration. I am otherwise open to suggestions about a better model for how to handle a type of memory capacity that elicits diverging opinions on whether it should be treated as System RAM, dedicated application memory, or some kind of cold-memory swap target. [1]: http://lore.kernel.org/cover.1737046620.git.nathan.fontenot@xxxxxxx > --------------------------------------------------------------- > Step 5: The CXL driver associating devices and iomem resources. > --------------------------------------------------------------- > > The CXL driver wires up the following devices: > root : CXL root > portN : An intermediate or endpoint destination for accesses > memN : memory devices > > > Each device in the heirarchy may have one or more decoders > decoderN.M : Address routing and translation devices > > > The driver will also create additional objects and associations > regionN : device-to-iomem resource mapping > dax_regionN : region-to-dax device mapping > > > Most associations built by the driver are done by validating decoders > against each other at each point in the heirarchy. > > Root decoders describe memory regions and route DMA to ports. > Intermediate decoders route DMA through CXL fabric. > Endpoint decoders translate addresses (Host to device). > > > A Root port has 1 decoder per associated CFMW in the CEDT > decoder0.0 -> `c050000000-fcefffffff : Soft Reserved` > > > A region (iomem resource mapping) can be created for these decoders > [/sys/bus/cxl/devices/region0]# cat resource size target0 > 0xc050000000 0x3ca0000000 decoder5.0 > > > A dax_region surfaces these regions as a dax device > [/sys/bus/cxl/devices/dax_region0/dax0.0]# cat resource > 0xc050000000 > > > So in a simple environment with 1 device, we end up with a mapping > that looks something like this. > > root --- decoder0.0 --- region0 -- dax_region0 -- dax0 > | | | > port1 --- decoder1.0 | > | | | > endpoint0 --- decoder3.0--------/ > > > Much of the complexity in region creation stems from validating decoder > programming and associating regions with targets (endpoint decoders). > > The take-away from this section is the existence of "decoders", of which > there may be an arbitrary number between the root and endpoint. > > This will be relevant when we talk about RAS (Poison) and Interleave. Good summary. I often look at this pile of objects and wonder "why so complex", but then I look at the heroics of drivers/edac/. Compared to that wide range of implementation specific quirks of various memory controllers, the CXL object hierarchy does not look that bad. > --------------------------------------------------------------- > Step 6: DAX surfacing Memory Blocks - First bit of User Policy. > --------------------------------------------------------------- > > The last step in surfacing memory to allocators is to convert a dax > device into memory blocks. On most default kernel builds, dax devices > are not automatically converted to SystemRAM. I thought most distributions are shipping with CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE, or the default online udev rule? For example Fedora is CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y and RHEL is CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=n, but with the udev hotplug rule. > Policy Choices > userland policy: daxctl > default-online : CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE > or > CONFIG_MHP_DEFAULT_ONLINE_TYPE_* > or > memhp_default_state=* > > To convert a dax device to SystemRAM utilizing daxctl: > > daxctl online-memory dax0.0 [--no-movable] On RHEL at least it finds that udev already took care of it. > > By default the memory will online into ZONE_MOVABLE > The --no-movable option will online the memory in ZONE_NORMAL > > > Alternatively, this can be done at Build or Boot time using > CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE (v6.13 or below) > CONFIG_MHP_DEFAULT_ONLINE_TYPE_* (v6.14 or above) > memhp_default_state=* (boot param predating cxl) Oh, TIL the new CONFIG_MHP_DEFAULT_ONLINE_TYPE_* option. > > I will save the discussion of ZONE selection to the next section, > which will cover more memory-hotplug specifics. > > At this point, the memory blocks are exposed to the kernel mm allocators > and may be used as normal System RAM. > > > --------------------------------------------------------- > Second bit of nuanced complexity: Memory Block Alignment. > --------------------------------------------------------- > In section 1, we introduced CEDT / CFMW and how they map to iomem > resources. In this section we discussed out we surface memory blocks > to the kernel allocators. > > However, at no time did platform, arch code, and driver communicate > about the expected size of a memory block. In most cases, the size > of a memory block is defined by the architecture - unaware of CXL. > > On x86, for example, the heuristic for memory block size is: > 1) user boot-arg value > 2) Maximize size (up to 2GB) if operating on bare metal > 3) Use smallest value that aligns with the end of memory > > The problem is that [SOFT RESERVED] memory is not considered in the > alignment calculation - and not all [SOFT RESERVED] memory *should* > be considered for alignment. > > In the case of our working example (real system, btw): > > Subtable Type : 01 [CXL Fixed Memory Window Structure] > Window base address : 000000C050000000 > Window size : 0000003CA0000000 > > The base is 256MB aligned (the minimum for the CXL Spec), and the > window size is 512MB. This results in a loss of almost a full memory > block worth of memory (~1280MB on the front, and ~512MB on the back). > > This is a loss of ~0.7% of capacity (1.5GB) for that region (121.25GB). This feels like an example, of "hey platform vendors, I understand that spec grants you the freedom to misalign, please refrain from taking advantage of that freedom". > > [1] has been proposed to allow for drivers (specifically ACPI) to advise > the memory hotplug system on the suggested alignment, and for arch code > to choose how to utilize this advisement. > > [1] https://lore.kernel.org/linux-mm/20250127153405.3379117-1-gourry@xxxxxxxxxx/ > > > -------------------------------------------------------------------- > The Complexity story up til now (what's likely to show up in slides) > -------------------------------------------------------------------- > Platform and BIOS: > May configure all the devices prior to kernel hand-off. > May or may not support reconfiguring / hotplug. > BIOS and EFI: > EFI_MEMORY_SP - used to defer management to drivers > Kernel Build and Boot: > CONFIG_EFI_SOFT_RESERVE=n - Will always result in CXL as SystemRAM > nosoftreserve - Will always result in CXL as SystemRAM > kexec - SystemRAM configs carry over to target > Driver Build Options Required > CONFIG_CXL_ACPI > CONFIG_CXL_BUS > CONFIG_CXL_MEM > CONFIG_CXL_PCI > CONFIG_CXL_PORT > CONFIG_CXL_REGION > CONFIG_DEV_DAX > CONFIG_DEV_DAX_CXL > CONFIG_DEV_DAX_KMEM > User Policy > CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE (<=v6.13) > CONFIG_MHP_DEFAULT_ONLINE_TYPE (>=v6.14) > memhp_default_state (boot param) > daxctl online-memory daxN.Y (userland) memory hotlpug udev rule (userland) > Nuances > Early-boot resource re-use > Memory Block Alignment > > -------------------------------------------------------------------- > Next Up: > Memory (Block) Hotplug - Zones and Kernel Use of CXL > RAS - Poison, MCE, and why you probably want CXL=ZONE_MOVABLE > Interleave - RAS and Region Management (Hotplug-ability) Really appreciate you organizing all of this information.