Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Famfs: shared memory file system for disaggregated memory [LSF/MM/BPF ATTEND]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 24/04/23 04:30PM, Amir Goldstein wrote:
> On Thu, Feb 29, 2024 at 2:20 AM John Groves <John@xxxxxxxxxx> wrote:
> >
> > John Groves, Micron
> >
> > Micron recently released the first RFC for famfs [1]. Although famfs is not
> > CXL-specific in any way, it aims to enable hosts to share data sets in shared
> > memory (such as CXL) by providing a memory-mappable fs-dax file system
> > interface to the memory.
> >
> > Sharable disaggregated memory already exists in the lab, and will be possible
> > in the wild soon. Famfs aims to do the following:
> >
> > * Provide an access method that provides isolation between files, and does not
> >   tempt developers to mmap all the memory writable on every host.
> > * Provide an an access method that can be used by unmodified apps.
> >
> > Without something like famfs, enabling the use of sharable memory will involve
> > the temptation to do things that may destabilize systems, such as
> > mapping large shared, writable global memory ranges and hooking allocators to
> > use it (potentially sacrificing isolation), and forcing the same virtual
> > address ranges in every host/process (compromising security).
> >
> > The most obvious candidate app categories are data analytics and data lakes.
> > Both make heavy use of "zero-copy" data frames - column oriented data that
> > is laid out for efficient use via (MAP_SHARED) mmap. Moreover, these use case
> > categories are generally driven by python code that wrangles data into
> > appropriate data frames - making it straightforward to put the data frames
> > into famfs. Furthermore, these use cases usually involve the shared data being
> > read-only during computation or query jobs - meaning they are often free of
> > cache coherency concerns.
> >
> > Workloads such as these often deal with data sets that are too large to fit
> > in a single server's memory, so the data gets sharded - requiring movement via
> > a network. Sharded apps also sometimes have to do expensive reshuffling -
> > moving data to nodes with available compute resources. Avoiding the sharding
> > overheads by accessing such data sets in disaggregated shared memory looks
> > promising to make make better use of memory and compute resources, and by
> > effectively de-duplicating data sets in memory.
> >
> > About sharable memory
> >
> > * Shared memory is pmem-like, in that hosts will connect in order to access
> >   pre-existing contents
> > * Onlining sharable memory as system-ram is nonsense; system-ram gets zeroed...
> > * CXL 3 provides for optionally-supported hardware-managed cache coherency
> > * But "multiple-readers, no writers" use cases don't need hardware support
> >   for coherency
> > * CXL 3.1 dynamic capacity devices (DCDs) should be thought of as devices with
> >   an allocator built in.
> > * When sharable capacity is allocated, each host that has access will see a
> >   /dev/dax device that can be found by the "tag" of the allocation. The tag is
> >   just a uuid.
> > * CXL 3.1 also allows the capacity associated with any allocated tag to be
> >   provided to each host (or host group) as either writable or read-only.
> >
> > About famfs
> >
> > Famfs is an append-only log-structured file system that places many limits
> > on what can be done. This allows famfs to tolerate clients with a stale copy
> > of metadata. All memory allocation and log maintenance is performed from user
> > space, but file extent lists are cached in the kernel for fast fault
> > resolution. The current limitations are fairly extreme, but many can be relaxed
> > by writing more code, managing Byzantine generals, etc. ;)
> >
> > A famfs-enabled kernel can be cloned at [3], and the user space repo can be
> > cloned at [4]. Even with major functional limitations in its current form
> > (e.g. famfs does not currently support deleting files), it is sufficient to
> > use in data analytics workloads - in which you 1) create a famfs file system,
> > 2) dump data sets into it, 3) run clustered jobs that consume the shared data
> > sets, and 4) dismount and deallocate the memory containing the file system.
> >
> > Famfs Open Issues
> >
> > * Volatile CXL memory is exposed as character dax devices; the famfs patch
> >   set adds the iomap API, which is required for fs-dax but until now missing
> >   from character dax.
> > * (/dev/pmem devices are block, and support the iomap api for fs-dax file
> >   systems)
> > * /dev/pmem devices can be converted to /dev/dax mode, but native /dev/dax
> >   devices cannot be converted to pmem mode.
> > * /dev/dax devices lack the iomap api that fs-dax uses with pmem, so the famfs
> >   patch set adds that.
> > * VFS layer hooks for a file system on a character device may be needed.
> > * Famfs has uncovered some previously latent bugs in the /dev/dax mmap
> >   machinery that probably require attention.
> > * Famfs currently works with either pmem or devdax devices, but our
> >   inclination is to drop pmem support to, reduce the complexity of supporting
> >   two different underlying device types - particularly since famfs is not
> >   intended for actual pmem.
> >
> >
> > Required :-
> > Dan Williams
> > Christian Brauner
> > Jonathan Cameron
> > Dave Hansen
> >
> > [LSF/MM + BPF ATTEND]
> >
> > I am the author of the famfs file system. Famfs was first introduced at LPC
> > 2023 [2]. I'm also Micron's voting member on the Software and Systems Working
> > Group (SSWG) of the CXL Consortium, and a co-author of the CXL 3.1
> > specification.
> >
> >
> > References
> >
> > [1] https://lore.kernel.org/linux-fsdevel/cover.1708709155.git.john@xxxxxxxxxx/#t
> > [2] https://lpc.events/event/17/contributions/1455/
> > [3] https://www.computeexpresslink.org/download-the-specification
> > [4] https://github.com/cxl-micron-reskit/famfs-linux
> >
> 
> Hi John,
> 
> Following our correspondence on your patch set [1], I am not sure that the
> details of famfs file system itself are an interesting topic for the
> LSFMM crowd??
> What I would like to do is schedule a session on:
> "Famfs: new userspace filesystem driver vs. improving FUSE/DAX"
> 
> I am hoping that Miklos and Bernd will be able to participate in this
> session remotely.
> 
> You see the last time that someone tried to introduce a specialized
> faster FUSE replacement [2], the comments from the community were
> that FUSE protocol can and should be improved instead of introducing
> another "filesystem in userspace" protocol.
> 
> Since 2019, FUSE has gained virtiofs/dax support, it recently gained
> FUSE passthrough support and Bernd is working on FUSE uring [3].
> 
> My hope is that you will be able to list the needed improvements
> to /dev/dax iomap and FUSE so that you could use the existing
> kernel infrastructure and FUSE libraries to implement famfs.
> 
> How does that sound for a discussion?
> 
> Thanks,
> Amir.
> 
> [1] https://lore.kernel.org/linux-fsdevel/3jwluwrqj6rwsxdsksfvdeo5uccgmnkh7rgefaeyxf2gu75344@ybhwncywkftx/
> [2] https://lore.kernel.org/linux-fsdevel/8d119597-4543-c6a4-917f-14f4f4a6a855@xxxxxxxxxx/
> [3] https://lore.kernel.org/linux-fsdevel/20230321011047.3425786-1-bschubert@xxxxxxx/

Amir,

That sounds good, thanks! I'll start preparing for it!

Re: [2]: I do think there are important ways that famfs is not "another 
filesystem in user space protocol" - but I'll save it for the LSFMM session!

FYI famfs v2 patches will be going out before LSFMM (and possibly before
next week).

Thanks Amir,
John





[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux