On 24/04/23 04:30PM, Amir Goldstein wrote: > On Thu, Feb 29, 2024 at 2:20 AM John Groves <John@xxxxxxxxxx> wrote: > > > > John Groves, Micron > > > > Micron recently released the first RFC for famfs [1]. Although famfs is not > > CXL-specific in any way, it aims to enable hosts to share data sets in shared > > memory (such as CXL) by providing a memory-mappable fs-dax file system > > interface to the memory. > > > > Sharable disaggregated memory already exists in the lab, and will be possible > > in the wild soon. Famfs aims to do the following: > > > > * Provide an access method that provides isolation between files, and does not > > tempt developers to mmap all the memory writable on every host. > > * Provide an an access method that can be used by unmodified apps. > > > > Without something like famfs, enabling the use of sharable memory will involve > > the temptation to do things that may destabilize systems, such as > > mapping large shared, writable global memory ranges and hooking allocators to > > use it (potentially sacrificing isolation), and forcing the same virtual > > address ranges in every host/process (compromising security). > > > > The most obvious candidate app categories are data analytics and data lakes. > > Both make heavy use of "zero-copy" data frames - column oriented data that > > is laid out for efficient use via (MAP_SHARED) mmap. Moreover, these use case > > categories are generally driven by python code that wrangles data into > > appropriate data frames - making it straightforward to put the data frames > > into famfs. Furthermore, these use cases usually involve the shared data being > > read-only during computation or query jobs - meaning they are often free of > > cache coherency concerns. > > > > Workloads such as these often deal with data sets that are too large to fit > > in a single server's memory, so the data gets sharded - requiring movement via > > a network. Sharded apps also sometimes have to do expensive reshuffling - > > moving data to nodes with available compute resources. Avoiding the sharding > > overheads by accessing such data sets in disaggregated shared memory looks > > promising to make make better use of memory and compute resources, and by > > effectively de-duplicating data sets in memory. > > > > About sharable memory > > > > * Shared memory is pmem-like, in that hosts will connect in order to access > > pre-existing contents > > * Onlining sharable memory as system-ram is nonsense; system-ram gets zeroed... > > * CXL 3 provides for optionally-supported hardware-managed cache coherency > > * But "multiple-readers, no writers" use cases don't need hardware support > > for coherency > > * CXL 3.1 dynamic capacity devices (DCDs) should be thought of as devices with > > an allocator built in. > > * When sharable capacity is allocated, each host that has access will see a > > /dev/dax device that can be found by the "tag" of the allocation. The tag is > > just a uuid. > > * CXL 3.1 also allows the capacity associated with any allocated tag to be > > provided to each host (or host group) as either writable or read-only. > > > > About famfs > > > > Famfs is an append-only log-structured file system that places many limits > > on what can be done. This allows famfs to tolerate clients with a stale copy > > of metadata. All memory allocation and log maintenance is performed from user > > space, but file extent lists are cached in the kernel for fast fault > > resolution. The current limitations are fairly extreme, but many can be relaxed > > by writing more code, managing Byzantine generals, etc. ;) > > > > A famfs-enabled kernel can be cloned at [3], and the user space repo can be > > cloned at [4]. Even with major functional limitations in its current form > > (e.g. famfs does not currently support deleting files), it is sufficient to > > use in data analytics workloads - in which you 1) create a famfs file system, > > 2) dump data sets into it, 3) run clustered jobs that consume the shared data > > sets, and 4) dismount and deallocate the memory containing the file system. > > > > Famfs Open Issues > > > > * Volatile CXL memory is exposed as character dax devices; the famfs patch > > set adds the iomap API, which is required for fs-dax but until now missing > > from character dax. > > * (/dev/pmem devices are block, and support the iomap api for fs-dax file > > systems) > > * /dev/pmem devices can be converted to /dev/dax mode, but native /dev/dax > > devices cannot be converted to pmem mode. > > * /dev/dax devices lack the iomap api that fs-dax uses with pmem, so the famfs > > patch set adds that. > > * VFS layer hooks for a file system on a character device may be needed. > > * Famfs has uncovered some previously latent bugs in the /dev/dax mmap > > machinery that probably require attention. > > * Famfs currently works with either pmem or devdax devices, but our > > inclination is to drop pmem support to, reduce the complexity of supporting > > two different underlying device types - particularly since famfs is not > > intended for actual pmem. > > > > > > Required :- > > Dan Williams > > Christian Brauner > > Jonathan Cameron > > Dave Hansen > > > > [LSF/MM + BPF ATTEND] > > > > I am the author of the famfs file system. Famfs was first introduced at LPC > > 2023 [2]. I'm also Micron's voting member on the Software and Systems Working > > Group (SSWG) of the CXL Consortium, and a co-author of the CXL 3.1 > > specification. > > > > > > References > > > > [1] https://lore.kernel.org/linux-fsdevel/cover.1708709155.git.john@xxxxxxxxxx/#t > > [2] https://lpc.events/event/17/contributions/1455/ > > [3] https://www.computeexpresslink.org/download-the-specification > > [4] https://github.com/cxl-micron-reskit/famfs-linux > > > > Hi John, > > Following our correspondence on your patch set [1], I am not sure that the > details of famfs file system itself are an interesting topic for the > LSFMM crowd?? > What I would like to do is schedule a session on: > "Famfs: new userspace filesystem driver vs. improving FUSE/DAX" > > I am hoping that Miklos and Bernd will be able to participate in this > session remotely. > > You see the last time that someone tried to introduce a specialized > faster FUSE replacement [2], the comments from the community were > that FUSE protocol can and should be improved instead of introducing > another "filesystem in userspace" protocol. > > Since 2019, FUSE has gained virtiofs/dax support, it recently gained > FUSE passthrough support and Bernd is working on FUSE uring [3]. > > My hope is that you will be able to list the needed improvements > to /dev/dax iomap and FUSE so that you could use the existing > kernel infrastructure and FUSE libraries to implement famfs. > > How does that sound for a discussion? > > Thanks, > Amir. > > [1] https://lore.kernel.org/linux-fsdevel/3jwluwrqj6rwsxdsksfvdeo5uccgmnkh7rgefaeyxf2gu75344@ybhwncywkftx/ > [2] https://lore.kernel.org/linux-fsdevel/8d119597-4543-c6a4-917f-14f4f4a6a855@xxxxxxxxxx/ > [3] https://lore.kernel.org/linux-fsdevel/20230321011047.3425786-1-bschubert@xxxxxxx/ Amir, That sounds good, thanks! I'll start preparing for it! Re: [2]: I do think there are important ways that famfs is not "another filesystem in user space protocol" - but I'll save it for the LSFMM session! FYI famfs v2 patches will be going out before LSFMM (and possibly before next week). Thanks Amir, John