Re: CXL Boot to Bash - Section 3: Memory (block) Hotplug

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Feb 17, 2025 at 12:05 PM Gregory Price <gourry@xxxxxxxxxx> wrote:
>
>
> The story up to now
> -------------------
> When we left the driver arena, we had created a dax device - which
> connects a Soft Reserved iomem resource to one or more `memory blocks`
> via the kmem driver.  We also discussed a bit about ZONE selection
> and default online behavior.
>
> In this section we'll discuss what actually goes into memory block
> creation, how those memory blocks are exposed to kernel allocators
> (tl;dr: sparsemem / memmap / struct page), and the implications of
> the selected memory zones.
>
>
> -------------------------------------
> Step 7: Hot-(un)plug Memory (Blocks).
> -------------------------------------
> Memory hotplug refers to surfacing physical memory to kernel
> allocators (page, slab, cache, etc) - as opposed to the action of
> "physically hotplugging" a device into a system (e.g. USB).
>
> Physical memory is exposed to allocators in the form of memory blocks.
>
> A `memory block` is an abstraction to describe a physically
> contiguous region memory, or more explicitly a collection of physically
> contiguous page frames which is described by a physically contiguous
> set of `struct page` structures in the system memory-map.
>
> The system memmap is what is used for pfn-to-page (struct) and
> page(struct)-to-pfn conversions. The system memmap has `flat` and
> `sparse` modes (configured at build-time). Memory hotplug requires the
> use of `sparsemem`, which aptly makes the memory map sparse.
>
> Hot *remove* (un-plug) is distinct from Hot add (plug).  To hot-remove
> an active memory block, the pages in-use must have their data (and
> therefore mappings) migrated to another memory block. Hot-remove must
> be specifically enabled separate from hotplug.
>
>
> Build configurations affecting memory block hot(un)plug
>   CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG
>   CONFIG_SPARSEMEM
>   CONFIG_64BIT
>   CONFIG_MEMORY_HOTPLUG
>   CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE
>   CONFIG_MHP_DEFAULT_ONLINE_TYPE_OFFLINE
>   CONFIG_MHP_DEFAULT_ONLINE_TYPE_AUTO
>   CONFIG_MHP_DEFAULT_ONLINE_TYPE_ONLINE_KERNEL
>   CONFIG_MHP_DEFAULT_ONLINE_TYPE_ONLINE_MOVABLE
>   CONFIG_MHP_MEMMAP_ON_MEMORY
>   CONFIG_ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
>   CONFIG_MIGRATION
>   CONFIG_MEMORY_HOTREMOVE
>
> During early-boot, the kernel finds all SystemRAM memory regions NOT
> marked "Special Purpose" and will create memory blocks for these
> regions by default.  These blocks are defaulted into ZONE_NORMAL
> (more on zones shortly).
>
> Memory regions present at boot marked `EFI_MEMORY_SP` have memory blocks
> created and hot-plugged by drivers.  The same mechanism is used to
> hot-add memory physically hotplugged after system boot (i.e. not present
> in the EFI Memory Map at boot time).
>
> The DAX/KMEM driver hotplugs memory blocks via the
>   `add_memory_driver_managed()`
> function.
>
>
> -------------------------------
> Step 8: Page Struct allocation.
> -------------------------------
> A `memory block` is made up of a collection of physical memory pages,
> which must have entries in the system Memory Map - which is managed by
> sparsemem on systems with memory (block) hotplug.  Sparsemem fills the
> memory map with `struct page` for hot-plugged memory.
>
> Here is a rough trace through the (current) stack on how page structs
> are populated into the system Memory Map on hotplug.
>
> ```
> add_memory_driver_managed
>   add_memory_resource
>     memblock_add_node
>       arch_add_memory
>         init_memory_mapping
>           add_pages
>             __add_pages
>               sparse_add_section
>                 section_activate
>                   populate_section_memmap
>                     __populate_section_memmap
>                       memmap_alloc
>                         memblock_alloc_try_nid_raw
>                           memblock_alloc_internal
>                             memblock_alloc_range_nid
>                               kzalloc_node(..., GFP_KERNEL, ...)
> ```
>
> All allocatable-memory requires `struct page` resources to describe the
> physical page state.  On a system with regular 4kb size pages and 256GB
> of memory - 4GB is required just to describe/manage the memory.
>
> This is ~1.5% of the new capacity to just surface it (4/256).
>
> This becomes an issue if the memory is not intended for kernel-use,
> as `struct page` memory must be allocated in non-movable, kernel memory
> `zones`.  If hot-plugged capacity is designated for a non-kernel zone
> (ZONE_MOVABLE, ZONE_DEVICE, etc), then there must be sufficient
> ZONE_NORMAL (or similar kernel-compatible zone) to allocate from.
>
> Matthew Wilcox has a plan to reduce this cost, some details of his plan:
> https://fosdem.org/2025/schedule/event/fosdem-2025-4860-shrinking-memmap/
> https://lore.kernel.org/all/Z37pxbkHPbLYnDKn@xxxxxxxxxxxxxxxxxxxx/
>
>
> ---------------------
> Step 9: Memory Zones.
> ---------------------
> We've alluded to "Memory Zones" in prior sections, with really the only
> detail about these concepts being that there are "Kernel-allocation
> compatible" and "Movable" zones, as well as some relationship between
> memory blocks and memory zones.
>
> The two zones we really care about are `ZONE_NORMAL` and `ZONE_MOVABLE`.
>
> For the purpose of this reading we'll consider two basic use-cases:
> - memory block hot-unplug
> - kernel resource allocation
>
> You can (for the most part) consider these cases incompatible.  If the
> kernel allocates `struct page` memory from a block, then that block cannot
> be hot-unplugged.  This memory is typically unmovable (cannot be migrated),
> and its pages unlikely to be removed from the memory map.
>
> There are other scenarios, such as page pinning, that can block hot-unplug.
> The individual mechanisms preventing hot-unplug are less important than
> their relationship to memory zones.
>
> ZONE_NORMAL basically allows any allocations, including things like page
> tables, struct pages, and pinned memory.
>
> ZONE_MOVABLE, under normal conditions, disallows most kernel allocations.
>
> ZONE_MOVABLE does NOT make a *strong* guarantee of hut-unplug-ability.
> The kernel and privileged users can cause long-term pinning to occur -
> even in ZONE_MOVABLE.  It should be seen as a best-attempt at providing
> hot-unplug-ability under normal conditions.
>
>
> Here's the take-away:
>
> Any capacity marked SystemRAM but not Special Purpose during early boot
> will be onlined into ZONE_NORMAL by default - making it available for
> kernel-use during boot.  There is no guarantee of being hot-unpluggable.
>
> Any capacity marked Special Purpose at boot, or hot-added (physically),
> will be onlined into a user-selected zone (Normal or Movable).
>
> There are (at least) 4 ways to select what zone to online memory blocks.
>
> Build Time:
>   CONFIG_MHP_DEFAULT_ONLINE_TYPE_*
> Boot Time:
>   memhp_default_state (boot parameter)
> udev / daxctl:
>   user policy explicitly requesting the zone
> memory sysfs
>   online_movable > /sys/bus/memory/devices/memoryN/online
>
>
> ------------------------------------------
> Nuance: memmap_on_memory and ZONE_MOVABLE.
> ------------------------------------------
> As alluded to in the prior sections - hot-added ZONE_MOVABLE capacity
> will consume ZONE_NORMAL capacity for its kernel resources.  This can
> be problematic if vast amounts of ZONE_MOVABLE is added on a system
> with limited ZONE_NORMAL capacity.
>
> For example, consider a system with 4GB of ZONE_NORMAL and 256GB of
> ZONE_MOVABLE.  This wouldn't work, as the entirety of ZONE_NORMAL would
> be consumed to allocate `struct page` resources for the ZONE_MOVABLE
> capacity - leaving no working memory for the rest of the kernel.
>
> The `memmap_on_memory` configuration option allows for hotplugged memory
> blocks to host their own `struct page` allocations...
>
>                    if they're placed in ZONE_NORMAL.
>
> To enable, use the boot param: `memory_hotplug.memmap_on_memory=1`.
>
> Sparsemem allocation of memory map resources ultimately uses a
> `kzalloc_node` call, which simply allocates memory from ZONE_NORMAL with
> a *suggested* node.
>
> ```
> memmap_alloc
>   memblock_alloc_try_nid_raw
>     memblock_alloc_internal
>       memblock_alloc_range_nid
>         kzalloc_node(..., GFP_KERNEL, ...)
> ```
>
> The node ID passed in as an argument is a "preferred node", which means
> is insufficient space on that node exists to service the GFP_KERNEL
> allocation, it will fall back to another node.
>
> If all hot-plugged memory is added to ZONE_MOVABLE, two things occur:
>
>   1) A portion of the memory block is carved out for to allocate memmap
>      data (reducing usable size by 64b*nr_pages)
>
>   2) The memory is allocated on ZONE_NORMAL on another node..

Nice write-up, thanks for putting everything together. A follow up
question on this. Do you mean the memmap memory will show up as a new
node with ZONE_NORMAL only besides other hot-plugged memory blocks? So
we will actually see two nodes are hot-plugged?

Thanks,
Yang

>
> Result: Lost capacity due to the unused carve-out area for no value.
>
> --------------------------------
> The Complexity Story up til now.
> --------------------------------
> Platform and BIOS:
>   May configure all the devices prior to kernel hand-off.
>   May or may not support reconfiguring / hotplug.
>
> BIOS and EFI:
>   EFI_MEMORY_SP              - used to defer management to drivers
>
> Kernel Build and Boot:
>   CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG
>   CONFIG_SPARSEMEM
>   CONFIG_64BIT
>   CONFIG_MEMORY_HOTPLUG
>   CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE
>   CONFIG_MHP_DEFAULT_ONLINE_TYPE_OFFLINE
>   CONFIG_MHP_DEFAULT_ONLINE_TYPE_AUTO
>   CONFIG_MHP_DEFAULT_ONLINE_TYPE_ONLINE_KERNEL
>   CONFIG_MHP_DEFAULT_ONLINE_TYPE_ONLINE_MOVABLE
>   CONFIG_MHP_MEMMAP_ON_MEMORY
>   CONFIG_ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
>   CONFIG_MIGRATION
>   CONFIG_MEMORY_HOTREMOVE
>   CONFIG_EFI_SOFT_RESERVE=n  - Will always result in CXL as SystemRAM
>   nosoftreserve              - Will always result in CXL as SystemRAM
>   kexec                      - SystemRAM configs carry over to target
>   memory_hotplug.memmap_on_memory
>
> Driver Build Options Required
>   CONFIG_CXL_ACPI
>   CONFIG_CXL_BUS
>   CONFIG_CXL_MEM
>   CONFIG_CXL_PCI
>   CONFIG_CXL_PORT
>   CONFIG_CXL_REGION
>   CONFIG_DEV_DAX
>   CONFIG_DEV_DAX_CXL
>   CONFIG_DEV_DAX_KMEM
>
> User Policy
>   CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE (<=v6.13)
>   CONFIG_MHP_DEFAULT_ONLINE_TYPE       (>=v6.14)
>   memhp_default_state                  (boot param)
>   daxctl online-memory daxN.Y          (userland)
>
> Nuances
>   Early-boot resource re-use
>   Memory Block Alignment
>   memmap_on_meomry + ZONE_MOVABLE
>
> ----------------------------------------------------
> Next up:
>   RAS - Poison, MCE, and why you probably want CXL=ZONE_MOVABLE
>   Interleave - RAS and Region Management
>
> ~Gregory
>





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux