On Thu, Feb 29, 2024 at 2:20 AM John Groves <John@xxxxxxxxxx> wrote: > > John Groves, Micron > > Micron recently released the first RFC for famfs [1]. Although famfs is not > CXL-specific in any way, it aims to enable hosts to share data sets in shared > memory (such as CXL) by providing a memory-mappable fs-dax file system > interface to the memory. > > Sharable disaggregated memory already exists in the lab, and will be possible > in the wild soon. Famfs aims to do the following: > > * Provide an access method that provides isolation between files, and does not > tempt developers to mmap all the memory writable on every host. > * Provide an an access method that can be used by unmodified apps. > > Without something like famfs, enabling the use of sharable memory will involve > the temptation to do things that may destabilize systems, such as > mapping large shared, writable global memory ranges and hooking allocators to > use it (potentially sacrificing isolation), and forcing the same virtual > address ranges in every host/process (compromising security). > > The most obvious candidate app categories are data analytics and data lakes. > Both make heavy use of "zero-copy" data frames - column oriented data that > is laid out for efficient use via (MAP_SHARED) mmap. Moreover, these use case > categories are generally driven by python code that wrangles data into > appropriate data frames - making it straightforward to put the data frames > into famfs. Furthermore, these use cases usually involve the shared data being > read-only during computation or query jobs - meaning they are often free of > cache coherency concerns. > > Workloads such as these often deal with data sets that are too large to fit > in a single server's memory, so the data gets sharded - requiring movement via > a network. Sharded apps also sometimes have to do expensive reshuffling - > moving data to nodes with available compute resources. Avoiding the sharding > overheads by accessing such data sets in disaggregated shared memory looks > promising to make make better use of memory and compute resources, and by > effectively de-duplicating data sets in memory. > > About sharable memory > > * Shared memory is pmem-like, in that hosts will connect in order to access > pre-existing contents > * Onlining sharable memory as system-ram is nonsense; system-ram gets zeroed... > * CXL 3 provides for optionally-supported hardware-managed cache coherency > * But "multiple-readers, no writers" use cases don't need hardware support > for coherency > * CXL 3.1 dynamic capacity devices (DCDs) should be thought of as devices with > an allocator built in. > * When sharable capacity is allocated, each host that has access will see a > /dev/dax device that can be found by the "tag" of the allocation. The tag is > just a uuid. > * CXL 3.1 also allows the capacity associated with any allocated tag to be > provided to each host (or host group) as either writable or read-only. > > About famfs > > Famfs is an append-only log-structured file system that places many limits > on what can be done. This allows famfs to tolerate clients with a stale copy > of metadata. All memory allocation and log maintenance is performed from user > space, but file extent lists are cached in the kernel for fast fault > resolution. The current limitations are fairly extreme, but many can be relaxed > by writing more code, managing Byzantine generals, etc. ;) > > A famfs-enabled kernel can be cloned at [3], and the user space repo can be > cloned at [4]. Even with major functional limitations in its current form > (e.g. famfs does not currently support deleting files), it is sufficient to > use in data analytics workloads - in which you 1) create a famfs file system, > 2) dump data sets into it, 3) run clustered jobs that consume the shared data > sets, and 4) dismount and deallocate the memory containing the file system. > > Famfs Open Issues > > * Volatile CXL memory is exposed as character dax devices; the famfs patch > set adds the iomap API, which is required for fs-dax but until now missing > from character dax. > * (/dev/pmem devices are block, and support the iomap api for fs-dax file > systems) > * /dev/pmem devices can be converted to /dev/dax mode, but native /dev/dax > devices cannot be converted to pmem mode. > * /dev/dax devices lack the iomap api that fs-dax uses with pmem, so the famfs > patch set adds that. > * VFS layer hooks for a file system on a character device may be needed. > * Famfs has uncovered some previously latent bugs in the /dev/dax mmap > machinery that probably require attention. > * Famfs currently works with either pmem or devdax devices, but our > inclination is to drop pmem support to, reduce the complexity of supporting > two different underlying device types - particularly since famfs is not > intended for actual pmem. > > > Required :- > Dan Williams > Christian Brauner > Jonathan Cameron > Dave Hansen > > [LSF/MM + BPF ATTEND] > > I am the author of the famfs file system. Famfs was first introduced at LPC > 2023 [2]. I'm also Micron's voting member on the Software and Systems Working > Group (SSWG) of the CXL Consortium, and a co-author of the CXL 3.1 > specification. > > > References > > [1] https://lore.kernel.org/linux-fsdevel/cover.1708709155.git.john@xxxxxxxxxx/#t > [2] https://lpc.events/event/17/contributions/1455/ > [3] https://www.computeexpresslink.org/download-the-specification > [4] https://github.com/cxl-micron-reskit/famfs-linux > Hi John, Following our correspondence on your patch set [1], I am not sure that the details of famfs file system itself are an interesting topic for the LSFMM crowd?? What I would like to do is schedule a session on: "Famfs: new userspace filesystem driver vs. improving FUSE/DAX" I am hoping that Miklos and Bernd will be able to participate in this session remotely. You see the last time that someone tried to introduce a specialized faster FUSE replacement [2], the comments from the community were that FUSE protocol can and should be improved instead of introducing another "filesystem in userspace" protocol. Since 2019, FUSE has gained virtiofs/dax support, it recently gained FUSE passthrough support and Bernd is working on FUSE uring [3]. My hope is that you will be able to list the needed improvements to /dev/dax iomap and FUSE so that you could use the existing kernel infrastructure and FUSE libraries to implement famfs. How does that sound for a discussion? Thanks, Amir. [1] https://lore.kernel.org/linux-fsdevel/3jwluwrqj6rwsxdsksfvdeo5uccgmnkh7rgefaeyxf2gu75344@ybhwncywkftx/ [2] https://lore.kernel.org/linux-fsdevel/8d119597-4543-c6a4-917f-14f4f4a6a855@xxxxxxxxxx/ [3] https://lore.kernel.org/linux-fsdevel/20230321011047.3425786-1-bschubert@xxxxxxx/