Miklos et. al.: This is the first in what will likely be a series of messages intended to organize a discussion of how famfs should integrate into fuse. Background Famfs [1] stands for fabric-attached memory file system, and it is an fs-dax file system that supports scale-out access to disaggregated shared memory. Famfs, even as a standalone file system, manages metadata and performs allocation from user space. But file mapping metadata is fully cached in the kernel so that vma/mapping faults are handled in-kernel. This provides "full memory speed" for mmap/read/write. Famfs [1] was introduced at LPC '23 [2], released as a standalone file system patch series in 2024 (v1/v2) [3][4], Discussed as a possible merge into fuse at LSFMM '24 [5][6], covered in an LPC '24 talk on shared memory [7], and discussed at some length in the fuse BOF at LPC '24. Terminology * fs/famfs - the standalone famfs kernel module [4] * fs/fuse - the fuse kernel module(s) * famfs_fused - the famfs fuse server/daemon (or any that uses these new features) * famfs userspace - the union of famfs_fused plus any other famfs components that live in user space Adding famfs metadata to fuse Porting famfs into fuse requires introducing two new classes of metadata info fs/fuse: file maps (aka fmaps - file to devdax extent lists) and devdax devices. Caching fmaps in the kernel allows servicing vma/mapping faults in-kernel without an up-call. Fmaps resolve to devdax memory, and fs/fuse must get exclusive access to a devdax device before faults to it can be handled via an fmap. Fully caching fmaps for active files is an absolute requirement for famfs, in order to perform at "memory speeds". Devdax devices Famfs file systems reside on devdax devices; fs/fuse will need to get exclusive access to these devices, and use the dev_dax_iomap api to resolve mapping faults for famfs files. The first devdax device is the primary (device 0) and fs/fuse needs exclusive access starting at mount time. This is the device where the primary superblock and metadata log are located. Although the superblock and log are read and written from user space, famfs exposes them as files (.meta/.superblock and .meta/.log). This approach avoids layering problems, as it avoids the need to concurrently access a devdax both raw (for the superblock and log) and through famfs (for regular file data); this would can't work because the kernel module needs exclusive access to the devdax devices. In addition, metadata log entries may add additional devdax devices that will be referenced by subsequent file entries in the log. We need a way to pass these into fs/fuse post mount. File maps (fmaps) Fmaps come in two flavors, and our design should assume that additional flavors might arise in the future. The two current flavors are: * Fmap header with a variable-length list of simple extents * Fmap header with a variable-length list of interleaved extents (each interleaved extent has a header and a variable-length list of "strip extents", which are described by the same simple extent structure as above) Passing fmaps into the kernel needs to pack the the message reply in a sane way to transport the variable-sized simple extent list, or the compound variable-sized interleaved extent list. If fuse already has a packing pattern for putting variable-sized structures in reply messages, please point me to it. Details on how fmaps actually work (skip if you don't care) Note: the fs/famfs links to code on github are to a newer version of fs/famfs than has been posted to the lists. In fs/famfs, fmaps are passed into the kernel via the FAMFSIOC_MAP_CREATE_V2:famfs_file_init_dax_v2() path [8]. This is a per-file ioctl. Relevant structures are at [9]. The relevant simple and interleaved structs come as a union after the common header in struct famfs_ioc_fmap. With simple extents, struct famfs_ioc_fmap is retrieved via copy_from_user(), and then the famfs_ioc_simple_extent array is retrieved based on famfs_ioc_fmap->fioc_nextents from the header struct. See the SIMPLE_DAX_EXTENT case in famfs_meta_alloc_v2() at [10]. Interleaved extents are a bit more complicated, but they are reasonably well-documented in a big comment if you scroll up from [9]. The number of interleaved extents is in famfs_ioc_fmap->fioc_niext, but each interleaved extent has a strip count and for the array of strip extents (the strip count is famfs_ioc_interleaved_ext->ie_nstrips). In fuse, we need to put this amalgamation into a variable-sized message payload. I think it's important for the method of serialization into messages not to apply arbitrary limits to extent or strip counts - although a total size limit of no less than 4K might be okay. I eagerly await suggestions from Miklos or others as to how best to do this, but I won't go further down this rathole now ;) Using fmaps to service faults Of course the ABI-constrained interchange format (famfs_ioc_fmap) isn't optimal for in-memory metadata (which should be able to evolve without breaking the ABI), so it's transmogrified into struct famfs_file_meta in fs/famfs, which contains a union of extent types (see [11]). The basic dax/iomap fault handling extent type uses these structs to translate a file offset to an offset on a daxdev. A simple extent fault is handled in fs/famfs by famfs_meta_to_dax_offset() ([12]). If it's a the extent type is INTERLEAVED_EXTENT, that function calls famfs_meta_to_dax_offset_v2() (scroll up from [12]), which resolves a mapping fault via an interleaved extent. This fault handling code will need to migrate into fuse_famfs.c (or fuse_dax_iomap.c, whatever...) in order to handle file mapping faults efficiently. Because we're enabling memory here - it must run at memory speeds. So how does the new metadata integrate into fs/fuse? One answer would be to attach a famfs flag to files at lookup time, and have fs/fuse send a new message to the famfs_fused to retrieve the fmap (with the recipe for serialization of the fmap into the reply message being TBD as of now). Stefan suggested this new message when we spoke in early October. That strikes me as more practical than putting the fmap thingy into the lookup reply "if needed" - but that is also an option if it's not somehow impractical or fubar. OK, what about multiple devdax devices? The first devdax device is special, in that it's needed at mount time. The current fs/famfs gets exclusive access to the root daxdev in famfs_get_tree(), by calling fs_dax_get(...&holder_ops). Note that fs_dax_get() is patched in by the dev_dax_iomap portion of the fs/famfs patch set [4], and will be needed in the famfs-fuse patch set as well, as the iomap api has not previously been available for devdax devices. I haven't started tackling how this will be hooked into fs/fuse, but it is a new thing for fs/fuse. If we pretend that's covered, the next issue occurs is when fs/fuse looks-up a famfs file that references a devdax device that hasn't been referenced previously. The first such occurrence will require providing daxdev(0) to fs/fuse, and additional instances will require providing daxdevs 1 onward. In user space, daxdevs are known by their uuid's, but they are known to fmaps by indices (0 being the primary device, and uuid/index mappings will be managed by the famfs userspace). It seems straightforward for fs/fuse - when it encounters an fmap that references an as-yet-unknown daxdev index - to send a new message to famfs_fused requesting that daxdev info by index, and receive the info in the reply. This approach would avoid the need for famfs_fused to be stateful wrt which daxdevs are known to fs/fuse. There are other ways this could be handled, and I'm open to input here - but I claim it won't work to require that the full daxdev list be populated at mount time - because adding them later is a very sane use case. I hope I've done a decent job of framing the initial design problems. I'm happy to answer questions if anything isn't clear - and I'm looking forward to your feedback - the more specific the better. Thanks, John [1] https://github.com/cxl-micron-reskit/famfs/blob/master/README.md [2] https://lpc.events/event/17/contributions/1455/ [3] https://lore.kernel.org/all/8bd01ff0-235f-4aa8-883d-4b71b505b74d@xxxxxxxxxxxxx/T/#m27639915e97443186b3ade9d1e94423bc58e6e22 [4] https://lore.kernel.org/linux-cxl/20240430-badeverbot-paletten-05442cfbbdf0@brauner/T/#mb75fb6522045dca2000d854cfa30de4006a96817 [5] https://www.youtube.com/watch?v=nMaZhXJJgmU [6] https://lwn.net/Articles/983105/ [7] https://lpc.events/event/18/contributions/1827/ [8] https://github.com/cxl-micron-reskit/famfs-linux/blob/v202410/fs/famfs/famfs_file.c#L482 [9] https://github.com/cxl-micron-reskit/famfs-linux/blob/v202410/include/uapi/linux/famfs_ioctl.h#L102 [10] https://github.com/cxl-micron-reskit/famfs-linux/blob/v202410/fs/famfs/famfs_file.c#L296 [11] https://github.com/cxl-micron-reskit/famfs-linux/blob/v202410/fs/famfs/famfs_internal.h#L18 [12] https://github.com/cxl-micron-reskit/famfs-linux/blob/v202410/fs/famfs/famfs_file.c#L772 [13] https://github.com/cxl-micron-reskit/famfs-linux/blob/v202410/fs/famfs/famfs_inode.c#L252