Of course, the whole purpose of using CXL memory is to allocate it. So lets talk about a real use case! char* a_page = malloc(4096); a_page[0] = '\x42'; /* Page fault and allocation occurs */ Congrats, you may or may not have allocated CXL memory! Fin. ----------------------------------------------------------------------- Ok, in all seriousness, the intent is hopefully to make this whole thing as transparent as possible - but there's quite a bit of complexity in how this is done while keeping things reasonably performant. This section won't cover the general intricacies of how page allocation works in the kernel, for that I would recommend Lorenzo Stoaks' book: The Linux Memory Manager. Notably, however, much of the content of this book concerned with Nodes and Zones was written pre-CXL. For the sake of this section we'll focus on how additional nodes and tiers *affect* allocations - whatever their mechanism (faults, file access, explicit, etc). That means I expect you'll at least have a base level understand of virtual memory and allocation-on-fault behavior. Most of what we're talking about here is reclaim and migration - not page faults. -------------------------------- Nodes, Tiers, and Zones - Oh My! -------------------------------- ========== NUMA Nodes ========== A NUMA node can *tacitly* be thought of as a "collection of homogeneous resources". This is a fancy way of saying "All the memory on a given node should have the same performance characteristics." As discussed in Sections 0 and 1, however, we saw how nodes are consturcted quite arbitrarily. All that truly matters is what your platform vendor has chosen to associate devices with "Proximity Domains" in the various ACPI tables. I'll stick with my moderately sane, and moderately wrong, definition. Lets consider a system with 2 sockets and 1 CXL device attached to a host bridge on each socket. ``` socket-interconnect | DRAM -- CPU0--------------CPU1 -- DRAM | | CXL0 CXL1 ``` The "Locality" information for these devices is built in the ACPI SLIT (System Locality Information Table). For example (caveat - fake!): ``` Signature : "SLIT" [System Locality Information Table] ... Localities : 0000000000000004 Locality 0 : 10 20 20 30 Locality 1 : 20 10 30 20 Locality 2 : FF FF 0A FF Locality 3 : FF FF FF 0A ``` This is what shows up via the `numactl -H` command ``` $ numactl -H node distances: node 0 1 2 3 0: 16 32 32 48 1: 32 16 48 32 2: 255 255 16 255 3: 255 255 255 16 ^^^ 255 typically means a node can't initiate access... typically i.e. "has no processors" ``` These "Locality" values are "Abstract Distances" - i.e. fancy lies from a black box piece of code that portends to describe something useful. You may think NUMA nodes have a clean topological relationship like so: ``` node0--------------node1 | | node2 node2 ``` In reality, all Linux knows is that these are "relative distances". ``` Node 0 distance from other nodes: 0->[1,2]->3 Node 1 distance from other nodes: 1->[0,3]->2 ``` Why does this matter? Lets imagine a Node 0 CPU allocates a page, but Node 0 is out of memory. Which node should Node 0 fall back to allocate from? In our example above, nodes `[1,2]` seem like equally good options. In reality, the cross-socket interconnect will usually be classified as "closer" than a CXL device. You can expect the following to be more realistic. ``` $ numactl -H node distances: node 0 1 2 3 0: 16 32 48 64 1: 32 16 64 48 2: 255 255 16 255 3: 255 255 255 16 Node 0 distance perspective: 0->1->2->3 Node 1 distance perspective 1->0->3->2 ``` Which makes sense, because typically the cross-socket interconnect will be faster than a CXL link, and for Node0 to access Node3, it must cross both interconnects. `Memory Tiers`, however, are quite a bit different. ============= Memory Tiers. ============= Memory tiers collect all similar-performance devices into a single "Tier". These tiers can be inspected via sysfs: [/sys/devices/virtual/memory_tiering/]# ls memory_tier4 memory_tier961 [/sys/devices/virtual/memory_tiering/]# cat memory_tier4/nodelist 0-1 [/sys/devices/virtual/memory_tiering/]# cat memory_tier4/nodelist 2-3 On our example 2-socket system, both sockets hosting local DRAM Would get lumped into the same tier, while both CXL devices would get lumped into the same tier (lets assume they have the same latency/bandwidth). ``` memory_tier4-------------------- / \ | node0 node1 | memory_tier961 / \ node2 node3 ``` This relationship, ostensibly, provides a quick and easy way to determine a rough performance-based relationship between nodes. This is, ostensibly, useful if you want to do memory tiering (demotion and/or promotion). More on this in a bit. Tiers are created based on performance data - typically provided by the HMAT or CXL CDAT data. The memory-tiers component treats socket-attached DRAM as the baseline, and generates its own abstract distance (different that the SLIT abstract distance!). ``` int mt_perf_to_adistance(struct access_coordinate *perf, int *adist) { ... snip ... /* * The abstract distance of a memory node is in direct proportion to * its memory latency (read + write) and inversely proportional to its * memory bandwidth (read + write). The abstract distance, memory * latency, and memory bandwidth of the default DRAM nodes are used as * the base. */ *adist = MEMTIER_ADISTANCE_DRAM * (perf->read_latency + perf->write_latency) / (default_dram_perf.read_latency + default_dram_perf.write_latency) * (default_dram_perf.read_bandwidth + default_dram_perf.write_bandwidth) / (perf->read_bandwidth + perf->write_bandwidth); } ``` The memory-tier component also provides a demotion-target mechanism, which creates a recommended demotion-target based on a given node. ``` /** * next_demotion_node() - Get the next node in the demotion path * @node: The starting node to lookup the next node * * Return: node id for next memory node in the demotion path hierarchy * from @node; NUMA_NO_NODE if @node is terminal. This does not keep * @node online or guarantee that it *continues* to be the next demotion * target. */ int next_demotion_node(int node); ``` The node_demotion map uses... SLIT provided node abstract distances to determine the target! ``` /* * node_demotion[] examples: * * Example 1: * * Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM nodes. * * node distances: * node 0 1 2 3 * 0 10 20 30 40 * 1 20 10 40 30 * 2 30 40 10 40 * 3 40 30 40 10 * * memory_tiers0 = 0-1 * memory_tiers1 = 2-3 * * node_demotion[0].preferred = 2 * node_demotion[1].preferred = 3 * node_demotion[2].preferred = <empty> * node_demotion[3].preferred = <empty> * ... * / ``` As of 03/13/2025, there is no `next_promotion_node()` counterpart to this function. As we'll probably learn in a later section: Promotion Is Hard. (TM) There is at least an interface to tell you whether a node is toptier: bool node_is_toptier(int node); The grand total of interfaces you need to know for the remainder of this section are exactly: next_demotion_node(int node) I may consider another section on Memory Tiering in the future, but this is sufficient for now. < unwarrated snark > "But Greg", you say, "it seems to me that memory-tiers as designed are of dubious value considering hardware interleave as described in Section 4 already combines multiple devices into a single node - and lumping remote nodes regardless of socket-relationship is at best misleading and at worst may actively cause performance regressions!" Very astute observation. Maybe we should rethink this component a bit. < / unwarranted snark > ============= Memory Zones. ============= In Section 3 (Memory Hotplug) we briefly discussed memory zones. For the purpose of this section, all we need to know is how these zones impact allocation. I will largely quote the Linux kernel docs here. ``` * ZONE_NORMAL is for normal memory that can be accessed by the kernel all the time. DMA operations can be performed on pages in this zone if the DMA devices support transfers to all addressable memory. ZONE_NORMAL is always enabled. * ZONE_MOVABLE is for normal accessible memory, just like ZONE_NORMAL. The difference is that the contents of most pages in ZONE_MOVABLE are movable. That means that while virtual addresses of these pages do not change, their content may move between different physical pages. ``` note: ZONE_NORMAL allocations MAY be movable. ZONE_MOVABLE must be. (for some definition of `must`, suspend disbelief for now) We generally don't want kernel resources incidentally finding themselves on CXL memory (a highly contended lock landing on far-memory by complete happenstance would by absolutely tragic). However, there aren't many mechanism to prevent this from occurring. While the kernel may *explicitly* allocate ZONE_MOVABLE memory via special interfaces, a typical `kmalloc()` call will utilize ZONE_NORMAL memory, as most kernel allocations are not guaranteed to be movable. That means most kernel allocations, should they happen to land on a remote node, are *stuck there*. For most use cases, then, we will want CXL memory onlined into ZONE_MOVABLE, because we'd like the option to migrate memory off of these devices for a variety of reasons. The most obvious mechanism to prevent the kernel from using CXL memory is to online CXL memory in ZONE_MOVABLE. However, ZONE_MOVABLE isn't without drawbacks. For example... Gigantic (1GB) pages are not allocable from ZONE_MOVABLE. Many hypervisors utilize Gigantic pages to limit TLB pressure. That means, for now, VM use cases must pick between incidental kernel use, and Gigantic page use. This is all to say: Memory zone configuration affects your performance. ---------------- Page Allocation. ---------------- There are a variety of ways pages can be allocated, for this section we'll just focus on a plain old allocate-on-fault interaction. char* page = malloc(4096); page[0] = '\x42'; /* Page fault, kernel allocates a page */ Assuming no special conditions (memory pressure, mempolicies, special prefetchers - whatever), the default allocation policy in linux is to allocate a page from the same node as the accessing CPU. So if, in our example, we are running on a node1 CPU and hit the above page fault, we'll allocate from the node1 DRAM. ``` Default allocation policy: access / \- allocation DRAM -- CPU0------CPU1 -- DRAM | | | | CXL0 CXL1 ``` Simple and, dare I say, elegant - really. This is, of course, assuming, we have no special conditions - of which there are, of course, many. ================ Memory Pressure. ================ Lets assume DRAM on node1 is pressured, and there is insufficient headroom to allocate a page on node1. What should we do? We have a few options. 1) Fall back to another node 2) Attempt to steal a physical page from someone else. ("reclaim") Lets assume reclaim doesn't exist for a moment. What node should fall back to? One might assume we would consider attempting to allocate based on the NUMA node topology. For example: ``` * node distances: * node 0 1 2 3 * 0 10 20 30 40 * 1 20 10 40 30 * 2 30 40 10 40 * 3 40 30 40 10 ``` In this topology, Node1 would prefer allocating from Node0 as a secondary source, and subsequently fall back to Node3 and Node2 as those nodes become pressured. This is basically what happens. But is that what we want? If a page is being allocated, it is almost by definition "hot", and so this has lead the kernel to conclude that - generally speaking - new allocations should be local unless explicitly requested otherwise. So instead, by default, we will start engaging the reclaim system. ================================ Reclaim - LRU, Zones, and Nodes. ================================ In the scenario where memory is pressured and reclaim is in use, Linux will go through a variety of phases to based on watermarks (Min, Low, High memory availability). These watermarks are used to determine when reclaim should run and when the system should block further allocations to ensure the kernel has sufficient headroom to make forward progress. An allocation may cause a kernel daemon to start moving pages through the LRU (least-recently-used) mechanism, or it may cause the task itself to engage in the process. The reclaim system may choose to swap pages to disk or to demote pages from the local node to a remote node. The key piece here is understanding the main LRU types and their relationship to zones and nodes. ``` ______node______ / \ ZONE_NORMAL ZONE_MOVABLE / \ / \ active LRU inactive LRU active LRU inactive LRU ``` Typically reclaim is engaged when attempting an allocation and the requested zone hits a low or min watermark. On our imaginary system, lets assume we've set up the following structure. ``` node0 - DRAM node2 - CXL | | ZONE_NORMAL ZONE_MOVABLE / \ / \ active_lru inactive_lru active_lru inactive_lru ``` node0 (local) has no ZONE_MOVABLE, and node2 has no ZONE_NORMAL. Since we always prefer allocations from the local node, we'll evicting pages from ZONE_NORMAL on node0 - that's the only zone we can allocate from. Specifically, reclaim will prefer to evict pages from the inactive lru and "age off" pages from the active_lru to the inactive_lru. If reclaim fails, it may then fail to allocate from the requested node and fallback to another node to continue forward progress. (or maybe OOM, or some other nebulous conditions - it's all really quite complex and well documented in Lorenzo's book, highly recommended!) ================== Swap vs. Demotion. =================== By default, the reclaim system will only age pages from active to inactive LRUs, and then move to evict pages from the inactive LRU (possibly engaging swap or just nixing read-only page mappings). However, reclaim can be configured to *demote* pages as well via the sysfs option: $ echo 1 > /sys/kernel/mm/numa/demotion_enabled In this scenario, rather than evict from inactive LRU to swap, we can demote a page from its current node to its closest demotion target. ``` mm/vmscan.c: static bool can_demote(int nid, struct scan_control *sc) { if (!numa_demotion_enabled) return false; if (sc && sc->no_demotion) return false; if (next_demotion_node(nid) == NUMA_NO_NODE) return false; return true; } /* * Take folios on @demote_folios and attempt to demote them to another node. * Folios which are not demoted are left on @demote_folios. */ static unsigned int demote_folio_list(struct list_head *demote_folios, struct pglist_data *pgdat) { ... /* Demotion ignores all cpuset and mempolicy settings */ migrate_pages(demote_folios, alloc_migrate_folio, NULL, (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION, &nr_succeeded); } ``` Notice that the comment here says "Demotion ignores cpuset". That means if you turn this setting on, and you require strong cpuset.mems isolation, you're in for a surprise! Another fun nuance to trip over. ====================================== Kernel Memory Tiering - A Short Story. ====================================== The story so far: In the beginning [Memory Tiering] was created. This has made a lot of people very angry and been widely regarded as a bad move. ~ Douglas Adams, The Restaurant at the End of the [CXL Fabric] There isn't a solid consensus on how memory tiering should be implemented in the kernel, so I will refrain from commenting on the various proposals for now. This obviously ikely deserves its own section which tumbles over 6 or 7 different RFCs in varying states - and ever so slightly misrepresents the work enough confuse everyone. Lets not, for now. So I will leave it here: Most of these systems aim at 3 goals: 1) create space on the local nodes for new allocations 2) demote cold memory from local nodes to make room for hot memory 3) promote hot memory from remote nodes to reduce average latencies. No one (largely) agrees what the best approach for this is, yet. If I were to make one request before anyone proposes yet *another* tiering mechanism, I would ask that you take a crack at implementing `next_promotion_node()` first. ``` /** * next_promotion_node() - Get the next node in the promotion path * @node: The starting node to lookup the next node * * Return: node id for next memory node in the promotion path hierarchy * from @node; NUMA_NO_NODE if @node is top tier. This does not keep * @node online or guarantee that it *continues* to be the next promotion * target. */ int next_promotion_node(int node); ``` That's the whole ballgame. Fin. ----------------------------------------------------------------------- Yes, that's basically it. The kernel prefers to allocate new pages from the local node, and it tries to demote memory to make sure this can happen. Otherwise - incidental direct allocation can occur on fallback. But how you configure your CXL memory dictates all this behavior. So it's extremely important that we get the configuration part right. This will be the end of the Boot to Bash series for the purpose of LSFMM 2025 background. We will likely continue in a github repo or something from here on. See you all in Montreal. ~Gregory