Re: CXL Boot to Bash - Section 2: The Drivers

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Gregory Price wrote:
> (background reading as we build up complexity)

Thanks for this taxonomy!

> 
> Driver Management - Decoders, HPA/SPA, DAX, and RAS.
> 
> The Drivers
> ===========
> ----------------------
> The Story Up 'til Now.
> ----------------------
> 
> When we left the Platform arena, assuming we've configured with special
> purpose memory, we are left with an entry in the memory map like so:
> 
> BIOS-e820:   [mem 0x000000c050000000-0x000000fcefffffff] soft reserved
> /proc/iomem: c050000000-fcefffffff : Soft Reserved
> 
> This resource (see mm/resource.c) is left unused until a driver comes
> along to actually surface it to allocators (or some other interface).
> 
> In our case, the drivers involved (or at least the ones we'll reference)
> 
> drivers/base/     : device probing, memory (block) hotplug
> drivers/acpi/     : device hotplug
> drivers/acpi/numa : NUMA ACPI table info (SRAT, CEDT, HMAT, ...)
> drivers/pci/      : PCI device probing
> drivers/cxl/      : CXL device probing
> drivers/dax/      : cxl device to memory resource association
> 
> We don't necessarily care about the specifics of each driver, we'll
> focus on just the aspects that ultimately affect memory management.
> 
> -------------------------------
> Step 4: Basic build complexity.
> -------------------------------
> To make a long story short:
> 
> CXL Build Configurations:
>   CONFIG_CXL_ACPI
>   CONFIG_CXL_BUS
>   CONFIG_CXL_MEM
>   CONFIG_CXL_PCI
>   CONFIG_CXL_PORT
>   CONFIG_CXL_REGION
> 
> DAX Build Configurations:
>   CONFIG_DEV_DAX
>   CONFIG_DEV_DAX_CXL
>   CONFIG_DEV_DAX_KMEM
> 
> Without all of these enabled, your journey will end up cut short because
> some piece of the probe process will stop progressing.
> 
> The most common misconfiguration I run into is CONFIG_DEV_DAX_CXL not
> being enabled. You end up with memory regions without dax devices.
> 
> [/sys/bus/cxl/devices]# ls
> dax_region0  decoder0.0  decoder1.0  decoder2.0 .....
> dax_region1  decoder0.1  decoder1.1  decoder3.0 .....
> 
> ^^^ These dax regions require `CONFIG_DEV_DAX_CXL` enabled to fully
> surface as dax devices, which can then be converted to system ram.

At least for this problem the plan is to fall back to
CONFIG_DEV_DAX_HMEM [1] which skips all of the RAS and device
enumeration benefits and just shunts EFI_MEMORY_SP over to device_dax.

There is also the panic button of efi=nosoftreserve which is the flag of
surrender if the kernel fails to parse the CXL configuration.

I am otherwise open to suggestions about a better model for how to
handle a type of memory capacity that elicits diverging opinions on
whether it should be treated as System RAM, dedicated application
memory, or some kind of cold-memory swap target.

[1]: http://lore.kernel.org/cover.1737046620.git.nathan.fontenot@xxxxxxx

> ---------------------------------------------------------------
> Step 5: The CXL driver associating devices and iomem resources.
> ---------------------------------------------------------------
> 
> The CXL driver wires up the following devices:
>    root        :  CXL root
>    portN       :  An intermediate or endpoint destination for accesses
>    memN        :  memory devices
> 
> 
> Each device in the heirarchy may have one or more decoders
>    decoderN.M  :  Address routing and translation devices
> 
> 
> The driver will also create additional objects and associations
>    regionN     :  device-to-iomem resource mapping
>    dax_regionN :  region-to-dax device mapping
> 
> 
> Most associations built by the driver are done by validating decoders
> against each other at each point in the heirarchy.
> 
>   Root decoders describe memory regions and route DMA to ports.
>   Intermediate decoders route DMA through CXL fabric.
>   Endpoint decoders translate addresses (Host to device).
> 
> 
> A Root port has 1 decoder per associated CFMW in the CEDT
>    decoder0.0  ->  `c050000000-fcefffffff   : Soft Reserved`
> 
> 
> A region (iomem resource mapping) can be created for these decoders
>    [/sys/bus/cxl/devices/region0]# cat resource size target0
>       0xc050000000   0x3ca0000000   decoder5.0
> 
> 
> A dax_region surfaces these regions as a dax device
>    [/sys/bus/cxl/devices/dax_region0/dax0.0]# cat resource
>       0xc050000000
> 
> 
> So in a simple environment with 1 device, we end up with a mapping
> that looks something like this.
> 
>      root      ---   decoder0.0  --- region0 -- dax_region0 -- dax0
>        |                |              |
>      port1     ---   decoder1.0        |
>        |                |              |
>      endpoint0 ---   decoder3.0--------/
> 
> 
> Much of the complexity in region creation stems from validating decoder
> programming and associating regions with targets (endpoint decoders).
> 
> The take-away from this section is the existence of "decoders", of which
> there may be an arbitrary number between the root and endpoint.
> 
> This will be relevant when we talk about RAS (Poison) and Interleave.

Good summary. I often look at this pile of objects and wonder "why so
complex", but then I look at the heroics of drivers/edac/. Compared to
that wide range of implementation specific quirks of various memory
controllers, the CXL object hierarchy does not look that bad.

> ---------------------------------------------------------------
> Step 6: DAX surfacing Memory Blocks - First bit of User Policy.
> ---------------------------------------------------------------
> 
> The last step in surfacing memory to allocators is to convert a dax
> device into memory blocks. On most default kernel builds, dax devices
> are not automatically converted to SystemRAM.

I thought most distributions are shipping with
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE, or the default online udev rule?
For example Fedora is CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y and RHEL is
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=n, but with the udev hotplug rule.

> Policy Choices
>    userland policy:  daxctl
>    default-online :  CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
>                      or
> 		     CONFIG_MHP_DEFAULT_ONLINE_TYPE_*
> 		     or
> 		     memhp_default_state=*
> 
> To convert a dax device to SystemRAM utilizing daxctl:
> 
>   daxctl online-memory dax0.0 [--no-movable]

On RHEL at least it finds that udev already took care of it.

> 
>   By default the memory will online into ZONE_MOVABLE
>   The --no-movable option will online the memory in ZONE_NORMAL
> 
> 
> Alternatively, this can be done at Build or Boot time using
>   CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE   (v6.13 or below)
>   CONFIG_MHP_DEFAULT_ONLINE_TYPE_*       (v6.14 or above)
>   memhp_default_state=*                  (boot param predating cxl)

Oh, TIL the new CONFIG_MHP_DEFAULT_ONLINE_TYPE_* option.

> 
> I will save the discussion of ZONE selection to the next section,
> which will cover more memory-hotplug specifics.
> 
> At this point, the memory blocks are exposed to the kernel mm allocators
> and may be used as normal System RAM.
> 
> 
> ---------------------------------------------------------
> Second bit of nuanced complexity: Memory Block Alignment.
> ---------------------------------------------------------
> In section 1, we introduced CEDT / CFMW and how they map to iomem
> resources.  In this section we discussed out we surface memory blocks
> to the kernel allocators.
> 
> However, at no time did platform, arch code, and driver communicate
> about the expected size of a memory block. In most cases, the size
> of a memory block is defined by the architecture - unaware of CXL.
> 
> On x86, for example, the heuristic for memory block size is:
>    1) user boot-arg value
>    2) Maximize size (up to 2GB) if operating on bare metal
>    3) Use smallest value that aligns with the end of memory
> 
> The problem is that [SOFT RESERVED] memory is not considered in the
> alignment calculation - and not all [SOFT RESERVED] memory *should*
> be considered for alignment.
> 
> In the case of our working example (real system, btw):
> 
>          Subtable Type : 01 [CXL Fixed Memory Window Structure]
>    Window base address : 000000C050000000
>            Window size : 0000003CA0000000
> 
> The base is 256MB aligned (the minimum for the CXL Spec), and the
> window size is 512MB.  This results in a loss of almost a full memory
> block worth of memory (~1280MB on the front, and ~512MB on the back).
> 
> This is a loss of ~0.7% of capacity (1.5GB) for that region (121.25GB).

This feels like an example, of "hey platform vendors, I understand
that spec grants you the freedom to misalign, please refrain from taking
advantage of that freedom".

> 
> [1] has been proposed to allow for drivers (specifically ACPI) to advise
> the memory hotplug system on the suggested alignment, and for arch code
> to choose how to utilize this advisement.
> 
> [1] https://lore.kernel.org/linux-mm/20250127153405.3379117-1-gourry@xxxxxxxxxx/
> 
> 
> --------------------------------------------------------------------
> The Complexity story up til now (what's likely to show up in slides)
> --------------------------------------------------------------------
> Platform and BIOS:
>   May configure all the devices prior to kernel hand-off.
>   May or may not support reconfiguring / hotplug.
> BIOS and EFI:
>   EFI_MEMORY_SP              - used to defer management to drivers
> Kernel Build and Boot:
>   CONFIG_EFI_SOFT_RESERVE=n  - Will always result in CXL as SystemRAM
>   nosoftreserve              - Will always result in CXL as SystemRAM
>   kexec                      - SystemRAM configs carry over to target
> Driver Build Options Required
>   CONFIG_CXL_ACPI
>   CONFIG_CXL_BUS
>   CONFIG_CXL_MEM
>   CONFIG_CXL_PCI
>   CONFIG_CXL_PORT
>   CONFIG_CXL_REGION
>   CONFIG_DEV_DAX
>   CONFIG_DEV_DAX_CXL
>   CONFIG_DEV_DAX_KMEM
> User Policy
>   CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE (<=v6.13)
>   CONFIG_MHP_DEFAULT_ONLINE_TYPE       (>=v6.14)
>   memhp_default_state                  (boot param)
>   daxctl online-memory daxN.Y          (userland)

    memory hotlpug udev rule		 (userland)

> Nuances
>   Early-boot resource re-use
>   Memory Block Alignment
> 
> --------------------------------------------------------------------
> Next Up:
>    Memory (Block) Hotplug - Zones and Kernel Use of CXL
>    RAS - Poison, MCE, and why you probably want CXL=ZONE_MOVABLE
>    Interleave - RAS and Region Management (Hotplug-ability)

Really appreciate you organizing all of this information.




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux