Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Famfs: shared memory file system for disaggregated memory [LSF/MM/BPF ATTEND]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Feb 29, 2024 at 2:20 AM John Groves <John@xxxxxxxxxx> wrote:
>
> John Groves, Micron
>
> Micron recently released the first RFC for famfs [1]. Although famfs is not
> CXL-specific in any way, it aims to enable hosts to share data sets in shared
> memory (such as CXL) by providing a memory-mappable fs-dax file system
> interface to the memory.
>
> Sharable disaggregated memory already exists in the lab, and will be possible
> in the wild soon. Famfs aims to do the following:
>
> * Provide an access method that provides isolation between files, and does not
>   tempt developers to mmap all the memory writable on every host.
> * Provide an an access method that can be used by unmodified apps.
>
> Without something like famfs, enabling the use of sharable memory will involve
> the temptation to do things that may destabilize systems, such as
> mapping large shared, writable global memory ranges and hooking allocators to
> use it (potentially sacrificing isolation), and forcing the same virtual
> address ranges in every host/process (compromising security).
>
> The most obvious candidate app categories are data analytics and data lakes.
> Both make heavy use of "zero-copy" data frames - column oriented data that
> is laid out for efficient use via (MAP_SHARED) mmap. Moreover, these use case
> categories are generally driven by python code that wrangles data into
> appropriate data frames - making it straightforward to put the data frames
> into famfs. Furthermore, these use cases usually involve the shared data being
> read-only during computation or query jobs - meaning they are often free of
> cache coherency concerns.
>
> Workloads such as these often deal with data sets that are too large to fit
> in a single server's memory, so the data gets sharded - requiring movement via
> a network. Sharded apps also sometimes have to do expensive reshuffling -
> moving data to nodes with available compute resources. Avoiding the sharding
> overheads by accessing such data sets in disaggregated shared memory looks
> promising to make make better use of memory and compute resources, and by
> effectively de-duplicating data sets in memory.
>
> About sharable memory
>
> * Shared memory is pmem-like, in that hosts will connect in order to access
>   pre-existing contents
> * Onlining sharable memory as system-ram is nonsense; system-ram gets zeroed...
> * CXL 3 provides for optionally-supported hardware-managed cache coherency
> * But "multiple-readers, no writers" use cases don't need hardware support
>   for coherency
> * CXL 3.1 dynamic capacity devices (DCDs) should be thought of as devices with
>   an allocator built in.
> * When sharable capacity is allocated, each host that has access will see a
>   /dev/dax device that can be found by the "tag" of the allocation. The tag is
>   just a uuid.
> * CXL 3.1 also allows the capacity associated with any allocated tag to be
>   provided to each host (or host group) as either writable or read-only.
>
> About famfs
>
> Famfs is an append-only log-structured file system that places many limits
> on what can be done. This allows famfs to tolerate clients with a stale copy
> of metadata. All memory allocation and log maintenance is performed from user
> space, but file extent lists are cached in the kernel for fast fault
> resolution. The current limitations are fairly extreme, but many can be relaxed
> by writing more code, managing Byzantine generals, etc. ;)
>
> A famfs-enabled kernel can be cloned at [3], and the user space repo can be
> cloned at [4]. Even with major functional limitations in its current form
> (e.g. famfs does not currently support deleting files), it is sufficient to
> use in data analytics workloads - in which you 1) create a famfs file system,
> 2) dump data sets into it, 3) run clustered jobs that consume the shared data
> sets, and 4) dismount and deallocate the memory containing the file system.
>
> Famfs Open Issues
>
> * Volatile CXL memory is exposed as character dax devices; the famfs patch
>   set adds the iomap API, which is required for fs-dax but until now missing
>   from character dax.
> * (/dev/pmem devices are block, and support the iomap api for fs-dax file
>   systems)
> * /dev/pmem devices can be converted to /dev/dax mode, but native /dev/dax
>   devices cannot be converted to pmem mode.
> * /dev/dax devices lack the iomap api that fs-dax uses with pmem, so the famfs
>   patch set adds that.
> * VFS layer hooks for a file system on a character device may be needed.
> * Famfs has uncovered some previously latent bugs in the /dev/dax mmap
>   machinery that probably require attention.
> * Famfs currently works with either pmem or devdax devices, but our
>   inclination is to drop pmem support to, reduce the complexity of supporting
>   two different underlying device types - particularly since famfs is not
>   intended for actual pmem.
>
>
> Required :-
> Dan Williams
> Christian Brauner
> Jonathan Cameron
> Dave Hansen
>
> [LSF/MM + BPF ATTEND]
>
> I am the author of the famfs file system. Famfs was first introduced at LPC
> 2023 [2]. I'm also Micron's voting member on the Software and Systems Working
> Group (SSWG) of the CXL Consortium, and a co-author of the CXL 3.1
> specification.
>
>
> References
>
> [1] https://lore.kernel.org/linux-fsdevel/cover.1708709155.git.john@xxxxxxxxxx/#t
> [2] https://lpc.events/event/17/contributions/1455/
> [3] https://www.computeexpresslink.org/download-the-specification
> [4] https://github.com/cxl-micron-reskit/famfs-linux
>

Hi John,

Following our correspondence on your patch set [1], I am not sure that the
details of famfs file system itself are an interesting topic for the
LSFMM crowd??
What I would like to do is schedule a session on:
"Famfs: new userspace filesystem driver vs. improving FUSE/DAX"

I am hoping that Miklos and Bernd will be able to participate in this
session remotely.

You see the last time that someone tried to introduce a specialized
faster FUSE replacement [2], the comments from the community were
that FUSE protocol can and should be improved instead of introducing
another "filesystem in userspace" protocol.

Since 2019, FUSE has gained virtiofs/dax support, it recently gained
FUSE passthrough support and Bernd is working on FUSE uring [3].

My hope is that you will be able to list the needed improvements
to /dev/dax iomap and FUSE so that you could use the existing
kernel infrastructure and FUSE libraries to implement famfs.

How does that sound for a discussion?

Thanks,
Amir.

[1] https://lore.kernel.org/linux-fsdevel/3jwluwrqj6rwsxdsksfvdeo5uccgmnkh7rgefaeyxf2gu75344@ybhwncywkftx/
[2] https://lore.kernel.org/linux-fsdevel/8d119597-4543-c6a4-917f-14f4f4a6a855@xxxxxxxxxx/
[3] https://lore.kernel.org/linux-fsdevel/20230321011047.3425786-1-bschubert@xxxxxxx/





[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux