Re: [RFC] Virtualizing tagged disaggregated memory capacity (app specific, multi host shared)

David Hildenbrand <david@xxxxxxxxxx> · Thu, 19 Sep 2024 11:09:40 +0200

Sorry for the late reply ...

Later on, some magic entity - let's call it an orchestrator, will tell the

In the virtio-mem world, that's usually something (admin/tool/whatever) 
in the hypervisor. What does it look like with CXL on bare metal?

memory pool to provide memory to a given host. The host gets notified by
the device of an 'offer' of specific memory extents and can accept it, after
which it may start to make use of the provided memory extent.

Those address ranges may be shared across multiple hosts (in which case they
are not for general use), or may be dedicated memory intended for use as
normal RAM.

Whilst the granularity of DCD extents is allowed by the specification to be very
fine (64 Bytes), in reality my expectation is no one will build general purpose
memory pool devices with fine granularity.
> > Memory hot-plug options (bare metal)
------------------------------------

By default, these extents will surface as either:
1) Normal memory hot-plugged into a NUMA node.
2) DAX - requiring applications to map that memory directly or use
    a filesystem etc.

There are various ways to apply policy to this. One is to base the policy
decision on a 'tag' that is associated with a set of DPA extents. That 'tag'
is metadata that originates at the orchestrator. It's big enough to hold a
UUID, so can convey whatever meaning is agreed by the orchestrator and the
software running on each host.

Memory pools tend to want to guarantee, when the circumstances change
(workload finishes etc), they can have the resources they allocated back.

Of course they want that guarantee. *insert usual unicorn example*

We can usually try hard, but "guarantee" is really a strong requirement 
that I am afraid we won't be able to give in many scenarios.

I'm sure CXL people were aware this is one of the basic issues of memory 
hotunplug (at least I kept telling them). If not, they didn't do their 
research properly or tried to ignore it.

CXL brings polite ways of asking for the memory back and big hammers for
when the host ignores things (which may well crash a naughty host).
Reliable hot unplug of normal memory continues to be a challenge for memory
that is 'normal' because not all its use / lifetime is tied to a particular
application.

Yes. And crashing is worth than anything else. Rather shutdown/reboot 
the offending machine in a somewhat nice way instead of crashing it.

Application specific memory
---------------------------

The DAX path enables association of the memory with a single application
by allowing that application to simply mmap appropriate /dev/daxX.Y
That device optionally has an associated tag.

When the application closes or otherwise releases that memory we can
guarantee to be able to recover the capacity.  Memory provided to an
application this way will be referred to here as Application Specific Memory.
This model also works for HBM or other 'better' memory that is reserved for
specific use cases.

So the flow is something like:
1. Cloud orchestrator decides it's going to run in memory database A
    on host W.
2. Memory appliance Z is told to 'offer' 1TB or memory to host W with
    UUID / tag wwwwxxxxzzzz
3. Host W accepts that memory (why would it say no?) and creates a
    DAX device for which the tag is discoverable.

Maybe there could be limitations (maximum addressable PFN?) where we 
would have to reject it? Not sure.

4. Orchestrator tells host W to launch the workload and that it
    should use the memory provided with tag wwwwxxxxzzzz.
5. Host launches DB and tells it to use DAX device with tag wwwwxxxxzzzz
    which the DB then mmap()s and loads it's database data into.
... sometime later....
6. Orchestrator tells host W to close that DB ad release the memory
    allocated from the pool.
7. Host gives the memory back to the memory appliance which can then use
    it to provide another host with the necessary memory.

This approach requires applications or at least memory allocation libraries to
be modified.  The guarantees of getting the memory they asked for + that they
will definitely be able to safely give the memory back when done, may make such
software modifications worthwhile.

There are disadvantages and bloat issues if the 'wrong' amount of memory is
allocated to the application. So these techniques only work when the
orchestrator has the necessary information about the workload.

Yes.

Note that one specific example of application specific memory is virtual
machines.  Just in this case the virtual machine is the application.
Later on it may be useful to consider the example of the specific
application in a VM being a nested virtual machine.

Shared Memory - closely related!
--------------------------------

CXL enables a number of different types of memory sharing across multiple
hosts:
- Read only shared memory (suitable for apache arrow for example)
- Hardware Coherent shared memory.
- Software managed coherency.

Do we have any timeline when we will see real shared-memory devices? zVM 
supported shared segments between VMs for a couple of decades.

These surface using the same machinery as non shared DCD extents. Note however
that the presentation, in terms of extents, to different host is not the same
(can be different extents, in an unrelated order) but along with tags, shared
extents have sufficient data to 'construct' a virtual address to HPA mapping
that makes them look the same to aware  application or file systems.  Current
proposed approach to this is to surface the extents via DAX and apply a
filesystem approach to managing the data.
https://lpc.events/event/18/contributions/1827/

These two types of memory pooling activity (shared memory, application specific
memory) both require capacity associated with a tag to be presented to specific
users in a fashion that is 'separate' from normal memory hot-plug.

The virtualization question
===========================

Having made the assumption that the models above are going to be used in
practice, and that Linux will support them, the natural next step is to
assume that applications designed against them are going to be used in virtual
machines as well as on bare metal hosts.

The open question this RFC is aiming to start discussion around is how best to
present them to the VM.  I want to get that discussion going early because
some of the options I can see will require specification additions and / or
significant PoC / development work to prove them out.  Before we go there,
let us briefly consider other uses of pooled memory in VMs and how theuy
aren't really relevant here.

Other virtualization uses of memory pool capacity
-------------------------------------------------

1. Part of static capacity of VM provided from a memory pool.
    Can be presented as a NUMA setup, with HMAT etc providing performance data
    relative to other memory the VM is using. Recovery of pooled capacity
    requires shutting down or migrating the VM.
2. Coarse grained memory increases for 'normal' memory.
    Can use memory hot-plug. Recovery of capacity likely to only be possible on
    VM shutdown.

Is there are reason "movable" (ZONE_MOVABLE) is not an option, at least 
in some setups? If not, why?

Both these use cases are well covered by existing solutions so we can ignore
them for the rest of this document.

Application specific or shared dynamic capacity - VM options.
-------------------------------------------------------------

1. Memory hot-plug - but with specific purpose memory flag set in EFI
    memory map.  Current default policy is to bring those up as normal memory.
    That policy can be adjusted via kernel option or Kconfig so they turn up
    as DAX.  We 'could' augment the metadata with such hot-plugged memory
    with the UID / tag from an underlying bare metal DAX device.

2. Virtio-mem - It may be possible to fit this use case within an extended
    virtio-mem.

3. Emulate a CXL type 3 device.

4. Other options?

Memory hotplug
--------------

This is the heavy weight solution but should 'work' if we close a specification
gap.  Granularity limitations are unlikely to be a big problem given anticipated
CXL devices.

Today, the EFI memory map has an attribute EFI_MEMORY_SP, for "Specific Purpose
Memory" intended to notify the operating system that it can use the memory as
normal, but it is there for a specific use case and so might be wanted back at
any point. This memory attribute can be provided in the memory map at boot
time and if associated with EfiReservedMemoryType can be used to indicate a
range of HPA Space where memory that is hot-plugged later should be treated as
'special'.

There isn't an obvious path to associate a particular range of hot plugged
memory with a UID / tag.  I'd expect we'd need to add something to the ACPI
specification to enable this.

Virtio-mem
----------

The design goals of virtio-mem [1] mean that it is not 'directly' applicable
to this case, but could perhaps be adapted with the addition of meta data
and DAX + guaranteed removal of explicit extents.

Maybe it could likely be extended, or one could built something similar 
that is better tailored to the "shared memory" use case.

[1] [virtio-mem: Paravirtualized Memory Hot(Un)Plug, David Hildenbrand and
Martin Schulz, Vee'21]

Emulating a CXL Type 3 Device
-----------------------------

Concerns raised about just emulating a CXL topology:
* A CXL Type 3 device is pretty complex.
* All we need is a tag + make it DAX, so surely this is too much?

Possible advantages
* Kernel is exactly the same as that running on the host. No new drivers or
   changes to existing drivers needed as what we are presenting is a possible
   device topology - which may be much simpler that the host.
> > Complexity:
***********

We don't emulate everything that can exist in physical topologies.
- One emulated device per host CXL Fixed Memory Window
   (I think we can't quite get away with just one in total due to BW/Latency
    discovery)
- Direct connect each emulated device to an emulate RP + Host Bridge.
- Single CXL Fixed memory Window.  Never present interleave (that's a host
   only problem).
- Can probably always present a single extent per DAX region (if we don't
   mind burning some GPA space to avoid fragmentation).

For "ordinary" hotplug virtio-mem provides real benefits over DIMMs. One 
thing to consider might be micro-vms where we want to emulate as little 
devices+infrastructure as possible.

So maybe looking into something paravirtualized that is more lightweight 
might make sense. Maybe not.

[...]

Migration
---------

VM migration will either have to remove all extents, or appropriately
prepopulate them prior to migration.  There are possible ways this
may be done with the same memory pool contents via 'temporal' sharing,
but in general this may bring additional complexity.
> > Kexec etc etc will be similar to how we handle it on the host - 
probably
just give all the capacity back.

kdump?

--
Cheers,

David / dhildenb