On Tue, 17 Sep 2024 20:56:53 +0100 Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx> wrote: > On Tue, 17 Sep 2024 19:37:21 +0000 > Jonathan Cameron <jonathan.cameron@xxxxxxxxxx> wrote: > > > Plan is currently to meet at lpc registration desk 2pm tomorrow Wednesday and we will find a room. > > > > And now the internet maybe knows my phone number (serves me right for using > my company mobile app that auto added a signature) > I might have been lucky and it didn't hit the archives because > the formatting was too broken.. > > Anyhow, see some of you tomorrow. I didn't manage to borrow a jabra mic > so remote will be tricky but feel free to reach out and we might be > able to sort something. > > Intent is this will be in informal BoF so we'll figure out the scope > at the start of the meeting. > > Sorry for the noise! Hack room 1.14 now if anyone is looking for us. > > Jonathan > > > J > > On Sun, 18 Aug 2024 21:12:34 -0500 > > John Groves <John@xxxxxxxxxx> wrote: > > > > > On 24/08/15 05:22PM, Jonathan Cameron wrote: > > > > Introduction > > > > ============ > > > > > > > > If we think application specific memory (including inter-host shared memory) is > > > > a thing, it will also be a thing people want to use with virtual machines, > > > > potentially nested. So how do we present it at the Host to VM boundary? > > > > > > > > This RFC is perhaps premature given we haven't yet merged upstream support for > > > > the bare metal case. However I'd like to get the discussion going given we've > > > > touched briefly on this in a number of CXL sync calls and it is clear no one is > > > > > > Excellent write-up, thanks Jonathan. > > > > > > Hannes' idea of an in-person discussion at LPC is a great idea - count me in. > > > > Had a feeling you might say that ;) > > > > > > > > As the proprietor of famfs [1] I have many thoughts. > > > > > > First, I like the concept of application-specific memory (ASM), but I wonder > > > if there might be a better term for it. ASM suggests that there is one > > > application, but I'd suggest that a more concise statement of the concept > > > is that the Linux kernel never accesses or mutates the memory - even though > > > multiple apps might share it (e.g. via famfs). It's a subtle point, but > > > an important one for RAS etc. ASM might better be called non-kernel-managed > > > memory - though that name does not have as good a ring to it. Will mull this > > > over further... > > > > Naming is always the hard bit :) I agree that one doesn't work for > > shared capacity. You can tell I didn't start there :) > > > > > > > > Now a few level-setting comments on CXL and Dynamic Capacity Devices (DCDs), > > > some of which will be obvious to many of you: > > > > > > * A DCD is just a memory device with an allocator and host-level > > > access-control built in. > > > * Usable memory from a DCD is not available until the fabric manger (likely > > > on behalf of an orchestrator) performs an Initiate Dynamic Capacity Add > > > command to the DCD. > > > * A DCD allocation has a tag (uuid) which is the invariant way of identifying > > > the memory from that allocation. > > > * The tag becomes known to the host from the DCD extents provided via > > > a CXL event following succesful allocation. > > > * The memory associated with a tagged allocation will surface as a dax device > > > on each host that has access to it. But of course dax device naming & > > > numbering won't be consistent across separate hosts - so we need to use > > > the uuid's to find specific memory. > > > > > > A few less foundational observations: > > > > > > * It does not make sense to "online" shared or sharable memory as system-ram, > > > because system-ram gets zeroed, which blows up use cases for sharable memory. > > > So the default for sharable memory must be devdax mode. > > (CXL specific diversion) > > > > Absolutely agree this this. There is a 'corner' that irritates me in the spec though > > which is that there is no distinction between shareable and shared capacity. > > If we are in a constrained setup with limited HPA or DPA space, we may not want > > to have separate DCD regions for these. Thus it is plausible that an orchestrator > > might tell a memory appliance to present memory for general use and yet it > > surfaces as shareable. So there may need to be an opt in path at least for > > going ahead and using this memory as normal RAM. > > > > > * Tags are mandatory for sharable allocations, and allowed but optional for > > > non-sharable allocations. The implication is that non-sharable allocations > > > may get onlined automatically as system-ram, so we don't need a namespace > > > for those. (I argued for mandatory tags on all allocations - hey you don't > > > have to use them - but encountered objections and dropped it.) > > > * CXL access control only goes to host root ports; CXL has no concept of > > > giving access to a VM. So some component on a host (perhaps logically > > > an orchestrator component) needs to plumb memory to VMs as appropriate. > > > > Yes. It's some mashup of an orchestrator and VMM / libvirt, local library > > of your choice. We can just group into into the ill defined concept of > > a distributed orchestrator. > > > > > > > > So tags are a namespace to find specific memory "allocations" (which in the > > > CXL consortium, we usually refer to as "tagged capacity"). > > > > > > In an orchestrated environment, the orchestrator would allocate resources > > > (including tagged memory capacity), make that capacity visible on the right > > > host(s), and then provide the tag when starting the app if needed. > > > > > > if (e.g.) the memory cotains a famfs file system, famfs needs the uuid of the > > > root memory allocation to find the right memory device. Once mounted, it's a > > > file sytem so apps can be directed to the mount path. Apps that consume the > > > dax devices directly also need the uuid because /dev/dax0.0 is not invariant > > > across a cluster... > > > > > > I have been assuming that when the CXL stack discovers a new DCD allocation, > > > it will configure the devdax device and provide some way to find it by tag. > > > /sys/cxl/<tag>/dev or whatever. That works as far as it goes, but I'm coming > > > around to thinking that the uuid-to-dax map should not be overtly CXL-specific. > > > > Agreed. Whether that's a nice kernel side thing, or a utility pulling data > > from various kernel subsystem interfaces doesn't really matter. I'd prefer > > the kernel presents this but maybe that won't work for some reason. > > > > > > > > General thoughts regarding VMs and qemu > > > > > > Physical connections to CXL memory are handled by physical servers. I don't > > > think there is a scenario in which a VM should interact directly with the > > > pcie function(s) of CXL devices. They will be configured as dax devices > > > (findable by their tags!) by the host OS, and should be provided to VMs > > > (when appropriate) as DAX devices. And software in a VM needs to be able to > > > find the right DAX device the same way it would running on bare metal - by > > > the tag. > > > > Limiting to typical type 3 memory pool devices. Agreed. The other CXL device > > types are a can or worms for another day. > > > > > > > > Qemu can already get memory from files (-object memory-backend-file,...), and > > > I believe this works whether it's an actual file or a devdax device. So far, > > > so good. > > > > > > Qemu can back a virtual pmem device by one of these, but currently (AFAIK) > > > not a virtual devdax device. I think virtual devdax is needed as a first-class > > > abstraction. If we can add the tag as a property of the memory-backend-file, > > > we're almost there - we just need away to lookup a daxdev by tag. > > > > I'm not sure that is simple. We'd need to define a new interface capable of: > > 1) Hotplug - potentially of many separate regions (think nested VMs). > > That more or less rules out using separate devices on a discoverable hotpluggable > > bus. We'd run out of bus numbers too quickly if putting them on PCI. > > ACPI style hotplug is worse because we have to provision slots at the outset. > > 2) Runtime provision of metadata - performance data very least (bandwidth / > > latency etc). In theory could wire up ACPI _HMA but no one has ever bothered. > > 3) Probably do want async error signaling. We 'could' do that with > > FW first error injection - I'm not sure it's a good idea but it's definitely > > an option. > > > > A locked down CXL device is a bit more than that, but not very much more. > > It's easy to fake registers for things that are always in one state so > > that the software stack is happy. > > > > virtio-mem has some of the parts and could perhaps be augmented > > to support this use case with the advantage of no implicit tie to CXL. > > > > > > > > > > Summary thoughts: > > > > > > * A mechanism for resolving tags to "tagged capacity" devdax devices is > > > essential (and I don't think there are specific proposals about this > > > mechanism so far). > > > > Agreed. > > > > > * Said mechanism should not be explicitly CXL-specific. > > > > Somewhat agreed, but I don't want to invent a new spec just to avoid explicit > > ties to CXL. I'm not against using CXL to present HBM / ACPI Specific Purpose > > memory for example to a VM. It will trivially work if that is what a user > > wants to do and also illustrates that this stuff doesn't necessarily just > > apply to capacity on a memory pool - it might just be 'weird' memory on the host. > > > > > * Finding a tagged capacity devdax device in a VM should work the same as it > > > does running on bare metal. > > > > Absolutely - that's a requirement. > > > > > * The file-backed (and devdax-backed) devdax abstraction is needed in qemu. > > > > Maybe. I'm not convinced the abstraction is needed at that particular level. > > > > > * Beyond that, I'm not yet sure what the lookup mechanism should be. Extra > > > points for being easy to implement in both physical and virtual systems. > > > > For physical systems we aren't going to get agreement :( For the systems > > I have visibility of there will be some diversity in hardware, but the > > presentation to userspace and up consistency should be doable. > > > > Jonathan > > > > > > > > Thanks for teeing this up! > > > John > > > > > > > > > [1] https://github.com/cxl-micron-reskit/famfs/blob/master/README.md > > > > > > > > > >