Re: [RFC] Virtualizing tagged disaggregated memory capacity (app specific, multi host shared)

Jonathan Cameron <jic23@xxxxxxxxxx> · Wed, 18 Sep 2024 13:12:32 +0100

On Tue, 17 Sep 2024 20:56:53 +0100
Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx> wrote:

> On Tue, 17 Sep 2024 19:37:21 +0000
> Jonathan Cameron <jonathan.cameron@xxxxxxxxxx> wrote:
> 
> > Plan is currently to meet at lpc registration desk 2pm tomorrow Wednesday and we will find a room.
> >  
> 
> And now the internet maybe knows my phone number (serves me right for using
> my company mobile app that auto added a signature)
> I might have been lucky and it didn't hit the archives because
> the formatting was too broken..
> 
> Anyhow, see some of you tomorrow.  I didn't manage to borrow a jabra mic
> so remote will be tricky but feel free to reach out and we might be
> able to sort something.
> 
> Intent is this will be in informal BoF so we'll figure out the scope
> at the start of the meeting.
> 
> Sorry for the noise!

Hack room 1.14 now if anyone is looking for us.

> 
> Jonathan
>  
> > J
> > On Sun, 18 Aug 2024 21:12:34 -0500
> > John Groves <John@xxxxxxxxxx> wrote:
> >   
> > > On 24/08/15 05:22PM, Jonathan Cameron wrote:    
> > > > Introduction
> > > > ============
> > > >
> > > > If we think application specific memory (including inter-host shared memory) is
> > > > a thing, it will also be a thing people want to use with virtual machines,
> > > > potentially nested. So how do we present it at the Host to VM boundary?
> > > >
> > > > This RFC is perhaps premature given we haven't yet merged upstream support for
> > > > the bare metal case. However I'd like to get the discussion going given we've
> > > > touched briefly on this in a number of CXL sync calls and it is clear no one is    
> > >
> > > Excellent write-up, thanks Jonathan.
> > >
> > > Hannes' idea of an in-person discussion at LPC is a great idea - count me in.    
> > 
> > Had a feeling you might say that ;)
> >   
> > >
> > > As the proprietor of famfs [1] I have many thoughts.
> > >
> > > First, I like the concept of application-specific memory (ASM), but I wonder
> > > if there might be a better term for it. ASM suggests that there is one
> > > application, but I'd suggest that a more concise statement of the concept
> > > is that the Linux kernel never accesses or mutates the memory - even though
> > > multiple apps might share it (e.g. via famfs). It's a subtle point, but
> > > an important one for RAS etc. ASM might better be called non-kernel-managed
> > > memory - though that name does not have as good a ring to it. Will mull this
> > > over further...    
> > 
> > Naming is always the hard bit :)  I agree that one doesn't work for
> > shared capacity. You can tell I didn't start there :)
> >   
> > >
> > > Now a few level-setting comments on CXL and Dynamic Capacity Devices (DCDs),
> > > some of which will be obvious to many of you:
> > >
> > > * A DCD is just a memory device with an allocator and host-level
> > >   access-control built in.
> > > * Usable memory from a DCD is not available until the fabric manger (likely
> > >   on behalf of an orchestrator) performs an Initiate Dynamic Capacity Add
> > >   command to the DCD.
> > > * A DCD allocation has a tag (uuid) which is the invariant way of identifying
> > >   the memory from that allocation.
> > > * The tag becomes known to the host from the DCD extents provided via
> > >   a CXL event following succesful allocation.
> > > * The memory associated with a tagged allocation will surface as a dax device
> > >   on each host that has access to it. But of course dax device naming &
> > >   numbering won't be consistent across separate hosts - so we need to use
> > >   the uuid's to find specific memory.
> > >
> > > A few less foundational observations:
> > >
> > > * It does not make sense to "online" shared or sharable memory as system-ram,
> > >   because system-ram gets zeroed, which blows up use cases for sharable memory.
> > >   So the default for sharable memory must be devdax mode.    
> > (CXL specific diversion)
> > 
> > Absolutely agree this this. There is a 'corner' that irritates me in the spec though
> > which is that there is no distinction between shareable and shared capacity.
> > If we are in a constrained setup with limited HPA or DPA space, we may not want
> > to have separate DCD regions for these.  Thus it is plausible that an orchestrator
> > might tell a memory appliance to present memory for general use and yet it
> > surfaces as shareable.  So there may need to be an opt in path at least for
> > going ahead and using this memory as normal RAM.
> >   
> > > * Tags are mandatory for sharable allocations, and allowed but optional for
> > >   non-sharable allocations. The implication is that non-sharable allocations
> > >   may get onlined automatically as system-ram, so we don't need a namespace
> > >   for those. (I argued for mandatory tags on all allocations - hey you don't
> > >   have to use them - but encountered objections and dropped it.)
> > > * CXL access control only goes to host root ports; CXL has no concept of
> > >   giving access to a VM. So some component on a host (perhaps logically
> > >   an orchestrator component) needs to plumb memory to VMs as appropriate.    
> > 
> > Yes.  It's some mashup of an orchestrator and VMM / libvirt, local library
> > of your choice. We can just group into into the ill defined concept of
> > a distributed orchestrator.
> >   
> > >
> > > So tags are a namespace to find specific memory "allocations" (which in the
> > > CXL consortium, we usually refer to as "tagged capacity").
> > >
> > > In an orchestrated environment, the orchestrator would allocate resources
> > > (including tagged memory capacity), make that capacity visible on the right
> > > host(s), and then provide the tag when starting the app if needed.
> > >
> > > if (e.g.) the memory cotains a famfs file system, famfs needs the uuid of the
> > > root memory allocation to find the right memory device. Once mounted, it's a
> > > file sytem so apps can be directed to the mount path. Apps that consume the
> > > dax devices directly also need the uuid because /dev/dax0.0 is not invariant
> > > across a cluster...
> > >
> > > I have been assuming that when the CXL stack discovers a new DCD allocation,
> > > it will configure the devdax device and provide some way to find it by tag.
> > > /sys/cxl/<tag>/dev or whatever. That works as far as it goes, but I'm coming
> > > around to thinking that the uuid-to-dax map should not be overtly CXL-specific.    
> > 
> > Agreed. Whether that's a nice kernel side thing, or a utility pulling data
> > from various kernel subsystem interfaces doesn't really matter. I'd prefer
> > the kernel presents this but maybe that won't work for some reason.
> >   
> > >
> > > General thoughts regarding VMs and qemu
> > >
> > > Physical connections to CXL memory are handled by physical servers. I don't
> > > think there is a scenario in which a VM should interact directly with the
> > > pcie function(s) of CXL devices. They will be configured as dax devices
> > > (findable by their tags!) by the host OS, and should be provided to VMs
> > > (when appropriate) as DAX devices. And software in a VM needs to be able to
> > > find the right DAX device the same way it would running on bare metal - by
> > > the tag.    
> > 
> > Limiting to typical type 3 memory pool devices. Agreed. The other CXL device
> > types are a can or worms for another day.
> >   
> > >
> > > Qemu can already get memory from files (-object memory-backend-file,...), and
> > > I believe this works whether it's an actual file or a devdax device. So far,
> > > so good.
> > >
> > > Qemu can back a virtual pmem device by one of these, but currently (AFAIK)
> > > not a virtual devdax device. I think virtual devdax is needed as a first-class
> > > abstraction. If we can add the tag as a property of the memory-backend-file,
> > > we're almost there - we just need away to lookup a daxdev by tag.    
> > 
> > I'm not sure that is simple. We'd need to define a new interface capable of:
> > 1) Hotplug - potentially of many separate regions (think nested VMs).
> >    That more or less rules out using separate devices on a discoverable hotpluggable
> >    bus. We'd run out of bus numbers too quickly if putting them on PCI.
> >    ACPI style hotplug is worse because we have to provision slots at the outset.
> > 2) Runtime provision of metadata - performance data very least (bandwidth /
> >    latency etc). In theory could wire up ACPI _HMA but no one has ever bothered.
> > 3) Probably do want async error signaling.  We 'could' do that with
> >    FW first error injection - I'm not sure it's a good idea but it's definitely
> >    an option.
> > 
> > A locked down CXL device is a bit more than that, but not very much more.
> > It's easy to fake registers for things that are always in one state so
> > that the software stack is happy.
> > 
> > virtio-mem has some of the parts and could perhaps be augmented
> > to support this use case with the advantage of no implicit tie to CXL.
> > 
> >   
> > >
> > > Summary thoughts:
> > >
> > > * A mechanism for resolving tags to "tagged capacity" devdax devices is
> > >   essential (and I don't think there are specific proposals about this
> > >   mechanism so far).    
> > 
> > Agreed.
> >   
> > > * Said mechanism should not be explicitly CXL-specific.    
> > 
> > Somewhat agreed, but I don't want to invent a new spec just to avoid explicit
> > ties to CXL. I'm not against using CXL to present HBM / ACPI Specific Purpose
> > memory for example to a VM. It will trivially work if that is what a user
> > wants to do and also illustrates that this stuff doesn't necessarily just
> > apply to capacity on a memory pool - it might just be 'weird' memory on the host.
> >   
> > > * Finding a tagged capacity devdax device in a VM should work the same as it
> > >   does running on bare metal.    
> > 
> > Absolutely - that's a requirement.
> >   
> > > * The file-backed (and devdax-backed) devdax abstraction is needed in qemu.    
> > 
> > Maybe. I'm not convinced the abstraction is needed at that particular level.
> >   
> > > * Beyond that, I'm not yet sure what the lookup mechanism should be. Extra
> > >   points for being easy to implement in both physical and virtual systems.    
> > 
> > For physical systems we aren't going to get agreement :(  For the systems
> > I have visibility of there will be some diversity in hardware, but the
> > presentation to userspace and up consistency should be doable.
> > 
> > Jonathan
> >   
> > >
> > > Thanks for teeing this up!
> > > John
> > >
> > >
> > > [1] https://github.com/cxl-micron-reskit/famfs/blob/master/README.md
> > >    
> > 
> > 
> >   
>