The story up to now ------------------- When we left the driver arena, we had created a dax device - which connects a Soft Reserved iomem resource to one or more `memory blocks` via the kmem driver. We also discussed a bit about ZONE selection and default online behavior. In this section we'll discuss what actually goes into memory block creation, how those memory blocks are exposed to kernel allocators (tl;dr: sparsemem / memmap / struct page), and the implications of the selected memory zones. ------------------------------------- Step 7: Hot-(un)plug Memory (Blocks). ------------------------------------- Memory hotplug refers to surfacing physical memory to kernel allocators (page, slab, cache, etc) - as opposed to the action of "physically hotplugging" a device into a system (e.g. USB). Physical memory is exposed to allocators in the form of memory blocks. A `memory block` is an abstraction to describe a physically contiguous region memory, or more explicitly a collection of physically contiguous page frames which is described by a physically contiguous set of `struct page` structures in the system memory-map. The system memmap is what is used for pfn-to-page (struct) and page(struct)-to-pfn conversions. The system memmap has `flat` and `sparse` modes (configured at build-time). Memory hotplug requires the use of `sparsemem`, which aptly makes the memory map sparse. Hot *remove* (un-plug) is distinct from Hot add (plug). To hot-remove an active memory block, the pages in-use must have their data (and therefore mappings) migrated to another memory block. Hot-remove must be specifically enabled separate from hotplug. Build configurations affecting memory block hot(un)plug CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG CONFIG_SPARSEMEM CONFIG_64BIT CONFIG_MEMORY_HOTPLUG CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE CONFIG_MHP_DEFAULT_ONLINE_TYPE_OFFLINE CONFIG_MHP_DEFAULT_ONLINE_TYPE_AUTO CONFIG_MHP_DEFAULT_ONLINE_TYPE_ONLINE_KERNEL CONFIG_MHP_DEFAULT_ONLINE_TYPE_ONLINE_MOVABLE CONFIG_MHP_MEMMAP_ON_MEMORY CONFIG_ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE CONFIG_MIGRATION CONFIG_MEMORY_HOTREMOVE During early-boot, the kernel finds all SystemRAM memory regions NOT marked "Special Purpose" and will create memory blocks for these regions by default. These blocks are defaulted into ZONE_NORMAL (more on zones shortly). Memory regions present at boot marked `EFI_MEMORY_SP` have memory blocks created and hot-plugged by drivers. The same mechanism is used to hot-add memory physically hotplugged after system boot (i.e. not present in the EFI Memory Map at boot time). The DAX/KMEM driver hotplugs memory blocks via the `add_memory_driver_managed()` function. ------------------------------- Step 8: Page Struct allocation. ------------------------------- A `memory block` is made up of a collection of physical memory pages, which must have entries in the system Memory Map - which is managed by sparsemem on systems with memory (block) hotplug. Sparsemem fills the memory map with `struct page` for hot-plugged memory. Here is a rough trace through the (current) stack on how page structs are populated into the system Memory Map on hotplug. ``` add_memory_driver_managed add_memory_resource memblock_add_node arch_add_memory init_memory_mapping add_pages __add_pages sparse_add_section section_activate populate_section_memmap __populate_section_memmap memmap_alloc memblock_alloc_try_nid_raw memblock_alloc_internal memblock_alloc_range_nid kzalloc_node(..., GFP_KERNEL, ...) ``` All allocatable-memory requires `struct page` resources to describe the physical page state. On a system with regular 4kb size pages and 256GB of memory - 4GB is required just to describe/manage the memory. This is ~1.5% of the new capacity to just surface it (4/256). This becomes an issue if the memory is not intended for kernel-use, as `struct page` memory must be allocated in non-movable, kernel memory `zones`. If hot-plugged capacity is designated for a non-kernel zone (ZONE_MOVABLE, ZONE_DEVICE, etc), then there must be sufficient ZONE_NORMAL (or similar kernel-compatible zone) to allocate from. Matthew Wilcox has a plan to reduce this cost, some details of his plan: https://fosdem.org/2025/schedule/event/fosdem-2025-4860-shrinking-memmap/ https://lore.kernel.org/all/Z37pxbkHPbLYnDKn@xxxxxxxxxxxxxxxxxxxx/ --------------------- Step 9: Memory Zones. --------------------- We've alluded to "Memory Zones" in prior sections, with really the only detail about these concepts being that there are "Kernel-allocation compatible" and "Movable" zones, as well as some relationship between memory blocks and memory zones. The two zones we really care about are `ZONE_NORMAL` and `ZONE_MOVABLE`. For the purpose of this reading we'll consider two basic use-cases: - memory block hot-unplug - kernel resource allocation You can (for the most part) consider these cases incompatible. If the kernel allocates `struct page` memory from a block, then that block cannot be hot-unplugged. This memory is typically unmovable (cannot be migrated), and its pages unlikely to be removed from the memory map. There are other scenarios, such as page pinning, that can block hot-unplug. The individual mechanisms preventing hot-unplug are less important than their relationship to memory zones. ZONE_NORMAL basically allows any allocations, including things like page tables, struct pages, and pinned memory. ZONE_MOVABLE, under normal conditions, disallows most kernel allocations. ZONE_MOVABLE does NOT make a *strong* guarantee of hut-unplug-ability. The kernel and privileged users can cause long-term pinning to occur - even in ZONE_MOVABLE. It should be seen as a best-attempt at providing hot-unplug-ability under normal conditions. Here's the take-away: Any capacity marked SystemRAM but not Special Purpose during early boot will be onlined into ZONE_NORMAL by default - making it available for kernel-use during boot. There is no guarantee of being hot-unpluggable. Any capacity marked Special Purpose at boot, or hot-added (physically), will be onlined into a user-selected zone (Normal or Movable). There are (at least) 4 ways to select what zone to online memory blocks. Build Time: CONFIG_MHP_DEFAULT_ONLINE_TYPE_* Boot Time: memhp_default_state (boot parameter) udev / daxctl: user policy explicitly requesting the zone memory sysfs online_movable > /sys/bus/memory/devices/memoryN/online ------------------------------------------ Nuance: memmap_on_memory and ZONE_MOVABLE. ------------------------------------------ As alluded to in the prior sections - hot-added ZONE_MOVABLE capacity will consume ZONE_NORMAL capacity for its kernel resources. This can be problematic if vast amounts of ZONE_MOVABLE is added on a system with limited ZONE_NORMAL capacity. For example, consider a system with 4GB of ZONE_NORMAL and 256GB of ZONE_MOVABLE. This wouldn't work, as the entirety of ZONE_NORMAL would be consumed to allocate `struct page` resources for the ZONE_MOVABLE capacity - leaving no working memory for the rest of the kernel. The `memmap_on_memory` configuration option allows for hotplugged memory blocks to host their own `struct page` allocations... if they're placed in ZONE_NORMAL. To enable, use the boot param: `memory_hotplug.memmap_on_memory=1`. Sparsemem allocation of memory map resources ultimately uses a `kzalloc_node` call, which simply allocates memory from ZONE_NORMAL with a *suggested* node. ``` memmap_alloc memblock_alloc_try_nid_raw memblock_alloc_internal memblock_alloc_range_nid kzalloc_node(..., GFP_KERNEL, ...) ``` The node ID passed in as an argument is a "preferred node", which means is insufficient space on that node exists to service the GFP_KERNEL allocation, it will fall back to another node. If all hot-plugged memory is added to ZONE_MOVABLE, two things occur: 1) A portion of the memory block is carved out for to allocate memmap data (reducing usable size by 64b*nr_pages) 2) The memory is allocated on ZONE_NORMAL on another node.. Result: Lost capacity due to the unused carve-out area for no value. -------------------------------- The Complexity Story up til now. -------------------------------- Platform and BIOS: May configure all the devices prior to kernel hand-off. May or may not support reconfiguring / hotplug. BIOS and EFI: EFI_MEMORY_SP - used to defer management to drivers Kernel Build and Boot: CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG CONFIG_SPARSEMEM CONFIG_64BIT CONFIG_MEMORY_HOTPLUG CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE CONFIG_MHP_DEFAULT_ONLINE_TYPE_OFFLINE CONFIG_MHP_DEFAULT_ONLINE_TYPE_AUTO CONFIG_MHP_DEFAULT_ONLINE_TYPE_ONLINE_KERNEL CONFIG_MHP_DEFAULT_ONLINE_TYPE_ONLINE_MOVABLE CONFIG_MHP_MEMMAP_ON_MEMORY CONFIG_ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE CONFIG_MIGRATION CONFIG_MEMORY_HOTREMOVE CONFIG_EFI_SOFT_RESERVE=n - Will always result in CXL as SystemRAM nosoftreserve - Will always result in CXL as SystemRAM kexec - SystemRAM configs carry over to target memory_hotplug.memmap_on_memory Driver Build Options Required CONFIG_CXL_ACPI CONFIG_CXL_BUS CONFIG_CXL_MEM CONFIG_CXL_PCI CONFIG_CXL_PORT CONFIG_CXL_REGION CONFIG_DEV_DAX CONFIG_DEV_DAX_CXL CONFIG_DEV_DAX_KMEM User Policy CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE (<=v6.13) CONFIG_MHP_DEFAULT_ONLINE_TYPE (>=v6.14) memhp_default_state (boot param) daxctl online-memory daxN.Y (userland) Nuances Early-boot resource re-use Memory Block Alignment memmap_on_meomry + ZONE_MOVABLE ---------------------------------------------------- Next up: RAS - Poison, MCE, and why you probably want CXL=ZONE_MOVABLE Interleave - RAS and Region Management ~Gregory