On 24/08/15 05:22PM, Jonathan Cameron wrote: > Introduction > ============ > > If we think application specific memory (including inter-host shared memory) is > a thing, it will also be a thing people want to use with virtual machines, > potentially nested. So how do we present it at the Host to VM boundary? > > This RFC is perhaps premature given we haven't yet merged upstream support for > the bare metal case. However I'd like to get the discussion going given we've > touched briefly on this in a number of CXL sync calls and it is clear no one is Excellent write-up, thanks Jonathan. Hannes' idea of an in-person discussion at LPC is a great idea - count me in. As the proprietor of famfs [1] I have many thoughts. First, I like the concept of application-specific memory (ASM), but I wonder if there might be a better term for it. ASM suggests that there is one application, but I'd suggest that a more concise statement of the concept is that the Linux kernel never accesses or mutates the memory - even though multiple apps might share it (e.g. via famfs). It's a subtle point, but an important one for RAS etc. ASM might better be called non-kernel-managed memory - though that name does not have as good a ring to it. Will mull this over further... Now a few level-setting comments on CXL and Dynamic Capacity Devices (DCDs), some of which will be obvious to many of you: * A DCD is just a memory device with an allocator and host-level access-control built in. * Usable memory from a DCD is not available until the fabric manger (likely on behalf of an orchestrator) performs an Initiate Dynamic Capacity Add command to the DCD. * A DCD allocation has a tag (uuid) which is the invariant way of identifying the memory from that allocation. * The tag becomes known to the host from the DCD extents provided via a CXL event following succesful allocation. * The memory associated with a tagged allocation will surface as a dax device on each host that has access to it. But of course dax device naming & numbering won't be consistent across separate hosts - so we need to use the uuid's to find specific memory. A few less foundational observations: * It does not make sense to "online" shared or sharable memory as system-ram, because system-ram gets zeroed, which blows up use cases for sharable memory. So the default for sharable memory must be devdax mode. * Tags are mandatory for sharable allocations, and allowed but optional for non-sharable allocations. The implication is that non-sharable allocations may get onlined automatically as system-ram, so we don't need a namespace for those. (I argued for mandatory tags on all allocations - hey you don't have to use them - but encountered objections and dropped it.) * CXL access control only goes to host root ports; CXL has no concept of giving access to a VM. So some component on a host (perhaps logically an orchestrator component) needs to plumb memory to VMs as appropriate. So tags are a namespace to find specific memory "allocations" (which in the CXL consortium, we usually refer to as "tagged capacity"). In an orchestrated environment, the orchestrator would allocate resources (including tagged memory capacity), make that capacity visible on the right host(s), and then provide the tag when starting the app if needed. if (e.g.) the memory cotains a famfs file system, famfs needs the uuid of the root memory allocation to find the right memory device. Once mounted, it's a file sytem so apps can be directed to the mount path. Apps that consume the dax devices directly also need the uuid because /dev/dax0.0 is not invariant across a cluster... I have been assuming that when the CXL stack discovers a new DCD allocation, it will configure the devdax device and provide some way to find it by tag. /sys/cxl/<tag>/dev or whatever. That works as far as it goes, but I'm coming around to thinking that the uuid-to-dax map should not be overtly CXL-specific. General thoughts regarding VMs and qemu Physical connections to CXL memory are handled by physical servers. I don't think there is a scenario in which a VM should interact directly with the pcie function(s) of CXL devices. They will be configured as dax devices (findable by their tags!) by the host OS, and should be provided to VMs (when appropriate) as DAX devices. And software in a VM needs to be able to find the right DAX device the same way it would running on bare metal - by the tag. Qemu can already get memory from files (-object memory-backend-file,...), and I believe this works whether it's an actual file or a devdax device. So far, so good. Qemu can back a virtual pmem device by one of these, but currently (AFAIK) not a virtual devdax device. I think virtual devdax is needed as a first-class abstraction. If we can add the tag as a property of the memory-backend-file, we're almost there - we just need away to lookup a daxdev by tag. Summary thoughts: * A mechanism for resolving tags to "tagged capacity" devdax devices is essential (and I don't think there are specific proposals about this mechanism so far). * Said mechanism should not be explicitly CXL-specific. * Finding a tagged capacity devdax device in a VM should work the same as it does running on bare metal. * The file-backed (and devdax-backed) devdax abstraction is needed in qemu. * Beyond that, I'm not yet sure what the lookup mechanism should be. Extra points for being easy to implement in both physical and virtual systems. Thanks for teeing this up! John [1] https://github.com/cxl-micron-reskit/famfs/blob/master/README.md