Introduction ============ If we think application specific memory (including inter-host shared memory) is a thing, it will also be a thing people want to use with virtual machines, potentially nested. So how do we present it at the Host to VM boundary? This RFC is perhaps premature given we haven't yet merged upstream support for the bare metal case. However I'd like to get the discussion going given we've touched briefly on this in a number of CXL sync calls and it is clear no one is entirely sure what direction make sense. We may briefly touch on this in the LPC CXL uconf, but time will be very limited. Aim here isn't to promote a particular path, but just to describe the problem and some potential solutions. It may be obvious which one I think is easiest, but it may be a case of I have that hammer so will hit things with it. It's also the case that we may not converge on a single solution and end up with several supported. That's not a problem as long as there isn't significant extra maintenance burden etc. There are subtle differences between likely deployments that may make certain solutions more attractive than others. Whilst I'm very familiar with the bare metal CXL part of this, I'm less familiar with the Virtual Machine and MM elements. Hence I'm hoping to get inputs from David Hildenbrand, particularly around virtio-mem as an option and many others to help fill in some of the gaps in information. I'd also like inputs from those like John Groves who are looking at inter-host sharing. I've also cc'd the QEMU list given all these solutions are likely to involve some additional emulation work and QEMU is my preferred choice for a reference implementation. I've almost certainly forgotten someone, so please do +CC others. Background ========== Skip if you already know all about CXL or similar memory pooling technologies. I've skipped over many of the details, because they hopefully don't matter for the core of the questions posed. I'm happy to provide more details though if this isn't detailed enough. Memory pool devices ------------------- CXL and similar technologies bring the option of having an 'appliance' that provides disaggregated memory to a number of servers with moderately low latency overhead compared to local memory. Typically these are multi-head devices directly connected to Root Ports of a number of different hosts. This design avoids the latency cost of a switched fabric. Experimental deployments suggest ratios of around 1 memory pool to 16 CPU sockets. [Pond: CXL-Based Memory Pooling Systems for Cloud Platforms, Li et Al. ASPLOS '23] In some deployments, each socket has on it's own connection allowing lowish latency (perhaps 1.5x typical inter socket), highish bandwidth memory expansion. Interleave can further boost the host to appliance bandwidth at the cost of reducing number of hosts sharing a single appliance. __________________ __________________ __________________ | Host A | | Host B | | Host C | | | | | | | | ____ ____ | | ____ ____ | | ____ ____ | |__|_RP_|__|_RP_|__| |__|_RP_|__|_RP_|__| |__|_RP_|__|_RP_|__| || || || || || || || || || || || || ____||______||___________||______||___________||______||_____ | | | | | Memory Appliance. | | | |_____________________________________________________________| CXL memory pooling options -------------------------- CXL 2.0 provided basic memory pool facilities via hot-plug of entire devices. This is both expensive to do and inflexible, so not well suited to memory appliance applications. CXL 3.0 and onwards introduced Dynamic Capacity in what is known as a Dynamic Capacity Device (DCD). We'll need a few key terms: Host Physical Address (HPA). Region of the host system address map where reads and writes will be routed to a particular CXL host bridge. This is considered a fixed mapping (may be changeable in BIOS) and presented to the host OS via an ACPI table. These windows are called CXL Fixed Memory Windows (CFMWs) Yes I'm being lazy here and HPA may not quite be the same as the view a CPU sees but that's a detail we don't care about here. Device Physical Address (DPA). Confusingly this isn't necessarily the addressing used on a device to access a particular physical address in the DRAM chips, but rather a presentation of the device memory to a particular host. There may be another level of translation underneath (this detail will matter later) Host Managed Device Memory Decoders (HDM Decoders). Programmable Address Routers that control the routing of a CXL transaction. Extents - Contiguous regions of DPA space (offset + size) Key elements of DCD usage ------------------------- Device to host address routing is not often changed. Typically it is set up at boot, either in host firmware, or once the operating system has started. That is, we'll probably program all the HDM Decoders once per boot. They may be left in a state where the host can reprogram them, or locked down. Regions of the DPA space that these decoders are routing the accesses to may not be backed by anything. A slight simplification is that these unbacked addresses read zero, and writes are dropped. Later on, some magic entity - let's call it an orchestrator, will tell the memory pool to provide memory to a given host. The host gets notified by the device of an 'offer' of specific memory extents and can accept it, after which it may start to make use of the provided memory extent. Those address ranges may be shared across multiple hosts (in which case they are not for general use), or may be dedicated memory intended for use as normal RAM. Whilst the granularity of DCD extents is allowed by the specification to be very fine (64 Bytes), in reality my expectation is no one will build general purpose memory pool devices with fine granularity. Memory hot-plug options (bare metal) ------------------------------------ By default, these extents will surface as either: 1) Normal memory hot-plugged into a NUMA node. 2) DAX - requiring applications to map that memory directly or use a filesystem etc. There are various ways to apply policy to this. One is to base the policy decision on a 'tag' that is associated with a set of DPA extents. That 'tag' is metadata that originates at the orchestrator. It's big enough to hold a UUID, so can convey whatever meaning is agreed by the orchestrator and the software running on each host. Memory pools tend to want to guarantee, when the circumstances change (workload finishes etc), they can have the resources they allocated back. CXL brings polite ways of asking for the memory back and big hammers for when the host ignores things (which may well crash a naughty host). Reliable hot unplug of normal memory continues to be a challenge for memory that is 'normal' because not all its use / lifetime is tied to a particular application. Application specific memory --------------------------- The DAX path enables association of the memory with a single application by allowing that application to simply mmap appropriate /dev/daxX.Y That device optionally has an associated tag. When the application closes or otherwise releases that memory we can guarantee to be able to recover the capacity. Memory provided to an application this way will be referred to here as Application Specific Memory. This model also works for HBM or other 'better' memory that is reserved for specific use cases. So the flow is something like: 1. Cloud orchestrator decides it's going to run in memory database A on host W. 2. Memory appliance Z is told to 'offer' 1TB or memory to host W with UUID / tag wwwwxxxxzzzz 3. Host W accepts that memory (why would it say no?) and creates a DAX device for which the tag is discoverable. 4. Orchestrator tells host W to launch the workload and that it should use the memory provided with tag wwwwxxxxzzzz. 5. Host launches DB and tells it to use DAX device with tag wwwwxxxxzzzz which the DB then mmap()s and loads it's database data into. ... sometime later.... 6. Orchestrator tells host W to close that DB ad release the memory allocated from the pool. 7. Host gives the memory back to the memory appliance which can then use it to provide another host with the necessary memory. This approach requires applications or at least memory allocation libraries to be modified. The guarantees of getting the memory they asked for + that they will definitely be able to safely give the memory back when done, may make such software modifications worthwhile. There are disadvantages and bloat issues if the 'wrong' amount of memory is allocated to the application. So these techniques only work when the orchestrator has the necessary information about the workload. Note that one specific example of application specific memory is virtual machines. Just in this case the virtual machine is the application. Later on it may be useful to consider the example of the specific application in a VM being a nested virtual machine. Shared Memory - closely related! -------------------------------- CXL enables a number of different types of memory sharing across multiple hosts: - Read only shared memory (suitable for apache arrow for example) - Hardware Coherent shared memory. - Software managed coherency. These surface using the same machinery as non shared DCD extents. Note however that the presentation, in terms of extents, to different host is not the same (can be different extents, in an unrelated order) but along with tags, shared extents have sufficient data to 'construct' a virtual address to HPA mapping that makes them look the same to aware application or file systems. Current proposed approach to this is to surface the extents via DAX and apply a filesystem approach to managing the data. https://lpc.events/event/18/contributions/1827/ These two types of memory pooling activity (shared memory, application specific memory) both require capacity associated with a tag to be presented to specific users in a fashion that is 'separate' from normal memory hot-plug. The virtualization question =========================== Having made the assumption that the models above are going to be used in practice, and that Linux will support them, the natural next step is to assume that applications designed against them are going to be used in virtual machines as well as on bare metal hosts. The open question this RFC is aiming to start discussion around is how best to present them to the VM. I want to get that discussion going early because some of the options I can see will require specification additions and / or significant PoC / development work to prove them out. Before we go there, let us briefly consider other uses of pooled memory in VMs and how theuy aren't really relevant here. Other virtualization uses of memory pool capacity ------------------------------------------------- 1. Part of static capacity of VM provided from a memory pool. Can be presented as a NUMA setup, with HMAT etc providing performance data relative to other memory the VM is using. Recovery of pooled capacity requires shutting down or migrating the VM. 2. Coarse grained memory increases for 'normal' memory. Can use memory hot-plug. Recovery of capacity likely to only be possible on VM shutdown. Both these use cases are well covered by existing solutions so we can ignore them for the rest of this document. Application specific or shared dynamic capacity - VM options. ------------------------------------------------------------- 1. Memory hot-plug - but with specific purpose memory flag set in EFI memory map. Current default policy is to bring those up as normal memory. That policy can be adjusted via kernel option or Kconfig so they turn up as DAX. We 'could' augment the metadata with such hot-plugged memory with the UID / tag from an underlying bare metal DAX device. 2. Virtio-mem - It may be possible to fit this use case within an extended virtio-mem. 3. Emulate a CXL type 3 device. 4. Other options? Memory hotplug -------------- This is the heavy weight solution but should 'work' if we close a specification gap. Granularity limitations are unlikely to be a big problem given anticipated CXL devices. Today, the EFI memory map has an attribute EFI_MEMORY_SP, for "Specific Purpose Memory" intended to notify the operating system that it can use the memory as normal, but it is there for a specific use case and so might be wanted back at any point. This memory attribute can be provided in the memory map at boot time and if associated with EfiReservedMemoryType can be used to indicate a range of HPA Space where memory that is hot-plugged later should be treated as 'special'. There isn't an obvious path to associate a particular range of hot plugged memory with a UID / tag. I'd expect we'd need to add something to the ACPI specification to enable this. Virtio-mem ---------- The design goals of virtio-mem [1] mean that it is not 'directly' applicable to this case, but could perhaps be adapted with the addition of meta data and DAX + guaranteed removal of explicit extents. [1] [virtio-mem: Paravirtualized Memory Hot(Un)Plug, David Hildenbrand and Martin Schulz, Vee'21] Emulating a CXL Type 3 Device ----------------------------- Concerns raised about just emulating a CXL topology: * A CXL Type 3 device is pretty complex. * All we need is a tag + make it DAX, so surely this is too much? Possible advantages * Kernel is exactly the same as that running on the host. No new drivers or changes to existing drivers needed as what we are presenting is a possible device topology - which may be much simpler that the host. Complexity: *********** We don't emulate everything that can exist in physical topologies. - One emulated device per host CXL Fixed Memory Window (I think we can't quite get away with just one in total due to BW/Latency discovery) - Direct connect each emulated device to an emulate RP + Host Bridge. - Single CXL Fixed memory Window. Never present interleave (that's a host only problem). - Can probably always present a single extent per DAX region (if we don't mind burning some GPA space to avoid fragmentation). In most real deployments, that's 1 CFMW, 1 pass through expander bridge, 1 RP and 1 EP. We would probably lock down the decoders before presentation to the kernel. Locking down routing is already supported by Linux as a BIOS may do this. That lock down simplifies the emulation. We already have most of what is needed emulated and upstream in QEMU with the exception of a necessary optimization to avoid interleave decoding (not relevant here, that is all for testing topology handling). PoC level code exists for that bit. The other aspect not yet enabled, is hotplugging additional memory backends into a single CXL Type 3 emulated device. I don't anticipate that being a problem, but PoC needed to be sure. One possible corner is that the Dynamic Capacity Flows in a physical machine require flushing caches due to changes of the physical address map. Care may be needed to silently drop such flushes if they are issued from the guest as it will not actually be changing the physical address map when capacity is added or released. Today, Linux associates a single NUMA node with a CXL Fixed Memory window. Whilst this is a limitation of the Linux handling, to avoid major changes to that infrastructure it may make sense to present multiple CXL Fixed Memory windows, so that the Guest can have separate NUMA nodes for memory pools with different characteristics. So I agreed complexity of this solution is a valid point, but mostly for emulation complexity. As emulated devices go it's not that complex (and we have most of it in place already and upstream in QEMU with Fan's DCD emulation support going in recently). Error handling: *************** What we mostly care about here is memory corruption. Protocol errors may be relevant if we an contain the resulting resets, but that is mostly a host problem. Synchronous memory errors should surface the same as normal. Asynchronous errors can either use FW first error injection into the VMM or inject emulated device errors (some support already in QEMU, additional support under review). Conclusion for Type 3 emulation ******************************* Seems doable. Complexity is control paths in the VMM. No kernel changes needed (I think!) What I'm looking for from this discussion ========================================= - Blockers! What problems do people anticipate with each approach? - General agreement on what we 'might' support in the kernel / QEMU / other VMMs. - Are there other use cases with similar requirements that we should incorporate? Appendix : Known corner cases ============================= These are here mostly for completeness and to track things we need to solve, rather that because they should greatly influence the path taken. CXL Type 3 Performance discovery -------------------------------- The discussion above suggests that we would represent interleaved CXL devices as a single device. Given NUMA characteristics of CXL attached memory are calculated based partly on the PCIe Link register values that currently indicate we have up to a 16x 64GT/s link, to present several higher performance devices that are interleaved as a single device may require representation of a device faster than hardware specifications allow. If this turns out to be a practical problem, solutions such as a PCIe DVSEC capability could be used to provided accurate information. If we can ensure the emulated link is not acting as a bottleneck, the rest of the performance information from the topology can be mapped to a combination of emulated host HMAT entries and emulated CDAT data provided by the emulated type 3 device. Migration --------- VM migration will either have to remove all extents, or appropriately prepopulate them prior to migration. There are possible ways this may be done with the same memory pool contents via 'temporal' sharing, but in general this may bring additional complexity. Kexec etc etc will be similar to how we handle it on the host - probably just give all the capacity back.