Re: [RFC] Virtualizing tagged disaggregated memory capacity (app specific, multi host shared)

John Groves <John@xxxxxxxxxx> · Sun, 18 Aug 2024 21:12:34 -0500

On 24/08/15 05:22PM, Jonathan Cameron wrote:
> Introduction
> ============
> 
> If we think application specific memory (including inter-host shared memory) is
> a thing, it will also be a thing people want to use with virtual machines,
> potentially nested. So how do we present it at the Host to VM boundary?
> 
> This RFC is perhaps premature given we haven't yet merged upstream support for
> the bare metal case. However I'd like to get the discussion going given we've
> touched briefly on this in a number of CXL sync calls and it is clear no one is

Excellent write-up, thanks Jonathan.

Hannes' idea of an in-person discussion at LPC is a great idea - count me in.

As the proprietor of famfs [1] I have many thoughts.

First, I like the concept of application-specific memory (ASM), but I wonder
if there might be a better term for it. ASM suggests that there is one
application, but I'd suggest that a more concise statement of the concept
is that the Linux kernel never accesses or mutates the memory - even though
multiple apps might share it (e.g. via famfs). It's a subtle point, but
an important one for RAS etc. ASM might better be called non-kernel-managed
memory - though that name does not have as good a ring to it. Will mull this
over further...

Now a few level-setting comments on CXL and Dynamic Capacity Devices (DCDs),
some of which will be obvious to many of you:

* A DCD is just a memory device with an allocator and host-level
  access-control built in.
* Usable memory from a DCD is not available until the fabric manger (likely
  on behalf of an orchestrator) performs an Initiate Dynamic Capacity Add
  command to the DCD.
* A DCD allocation has a tag (uuid) which is the invariant way of identifying
  the memory from that allocation.
* The tag becomes known to the host from the DCD extents provided via
  a CXL event following succesful allocation.
* The memory associated with a tagged allocation will surface as a dax device
  on each host that has access to it. But of course dax device naming &
  numbering won't be consistent across separate hosts - so we need to use
  the uuid's to find specific memory.

A few less foundational observations:

* It does not make sense to "online" shared or sharable memory as system-ram,
  because system-ram gets zeroed, which blows up use cases for sharable memory.
  So the default for sharable memory must be devdax mode.
* Tags are mandatory for sharable allocations, and allowed but optional for
  non-sharable allocations. The implication is that non-sharable allocations
  may get onlined automatically as system-ram, so we don't need a namespace
  for those. (I argued for mandatory tags on all allocations - hey you don't
  have to use them - but encountered objections and dropped it.)
* CXL access control only goes to host root ports; CXL has no concept of
  giving access to a VM. So some component on a host (perhaps logically
  an orchestrator component) needs to plumb memory to VMs as appropriate.

So tags are a namespace to find specific memory "allocations" (which in the
CXL consortium, we usually refer to as "tagged capacity").

In an orchestrated environment, the orchestrator would allocate resources
(including tagged memory capacity), make that capacity visible on the right
host(s), and then provide the tag when starting the app if needed.

if (e.g.) the memory cotains a famfs file system, famfs needs the uuid of the
root memory allocation to find the right memory device. Once mounted, it's a
file sytem so apps can be directed to the mount path. Apps that consume the
dax devices directly also need the uuid because /dev/dax0.0 is not invariant
across a cluster...

I have been assuming that when the CXL stack discovers a new DCD allocation,
it will configure the devdax device and provide some way to find it by tag.
/sys/cxl/<tag>/dev or whatever. That works as far as it goes, but I'm coming
around to thinking that the uuid-to-dax map should not be overtly CXL-specific.

General thoughts regarding VMs and qemu

Physical connections to CXL memory are handled by physical servers. I don't
think there is a scenario in which a VM should interact directly with the
pcie function(s) of CXL devices. They will be configured as dax devices
(findable by their tags!) by the host OS, and should be provided to VMs
(when appropriate) as DAX devices. And software in a VM needs to be able to
find the right DAX device the same way it would running on bare metal - by
the tag.

Qemu can already get memory from files (-object memory-backend-file,...), and
I believe this works whether it's an actual file or a devdax device. So far,
so good.

Qemu can back a virtual pmem device by one of these, but currently (AFAIK)
not a virtual devdax device. I think virtual devdax is needed as a first-class
abstraction. If we can add the tag as a property of the memory-backend-file,
we're almost there - we just need away to lookup a daxdev by tag.

Summary thoughts:

* A mechanism for resolving tags to "tagged capacity" devdax devices is
  essential (and I don't think there are specific proposals about this
  mechanism so far).
* Said mechanism should not be explicitly CXL-specific.
* Finding a tagged capacity devdax device in a VM should work the same as it
  does running on bare metal.
* The file-backed (and devdax-backed) devdax abstraction is needed in qemu.
* Beyond that, I'm not yet sure what the lookup mechanism should be. Extra
  points for being easy to implement in both physical and virtual systems.

Thanks for teeing this up!
John

[1] https://github.com/cxl-micron-reskit/famfs/blob/master/README.md