[RFC] [Design] Planning the famfs port into fs/fuse

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Miklos et. al.:

This is the first in what will likely be a series of messages intended
to organize a discussion of how famfs should integrate into fuse.

Background

Famfs [1] stands for fabric-attached memory file system, and it is an fs-dax
file system that supports scale-out access to disaggregated shared memory.
Famfs, even as a standalone file system, manages metadata and performs
allocation from user space. But file mapping metadata is fully cached in
the kernel so that vma/mapping faults are handled in-kernel. This provides
"full memory speed" for mmap/read/write.

Famfs [1] was introduced at LPC '23 [2], released as a standalone file
system patch series in 2024 (v1/v2) [3][4], Discussed as a possible merge
into fuse at LSFMM '24 [5][6], covered in an LPC '24 talk on
shared memory [7], and discussed at some length in the fuse BOF at LPC '24.

Terminology

* fs/famfs - the standalone famfs kernel module [4]
* fs/fuse - the fuse kernel module(s)
* famfs_fused - the famfs fuse server/daemon (or any that uses these new
  features)
* famfs userspace - the union of famfs_fused plus any other famfs components
  that live in user space

Adding famfs metadata to fuse

Porting famfs into fuse requires introducing two new classes of metadata
info fs/fuse: file maps (aka fmaps - file to devdax extent lists) and devdax
devices. Caching fmaps in the kernel allows servicing vma/mapping faults
in-kernel without an up-call. Fmaps resolve to devdax memory, and fs/fuse
must get exclusive access to a devdax device before faults to it can be
handled via an fmap.

Fully caching fmaps for active files is an absolute requirement for famfs,
in order to perform at "memory speeds".

Devdax devices

Famfs file systems reside on devdax devices; fs/fuse will need to
get exclusive access to these devices, and use the dev_dax_iomap api
to resolve mapping faults for famfs files.

The first devdax device is the primary (device 0) and fs/fuse needs
exclusive access starting at mount time. This is the device where the
primary superblock and metadata log are located. Although the superblock
and log are read and written from user space, famfs exposes them as files
(.meta/.superblock and .meta/.log). This approach avoids layering problems,
as it avoids the need to concurrently access a devdax both raw (for the
superblock and log) and through famfs (for regular file data); this would
can't work because the kernel module needs exclusive access to the devdax
devices.

In addition, metadata log entries may add additional devdax devices that
will be referenced by subsequent file entries in the log. We need a way to
pass these into fs/fuse post mount.

File maps (fmaps)

Fmaps come in two flavors, and our design should assume that additional
flavors might arise in the future. The two current flavors are:

* Fmap header with a variable-length list of simple extents
* Fmap header with a variable-length list of interleaved extents
  (each interleaved extent has a header and a variable-length list of
  "strip extents", which are described by the same simple extent structure
  as above)

Passing fmaps into the kernel needs to pack the the message reply in a sane
way to transport the variable-sized simple extent list, or the compound
variable-sized interleaved extent list. If fuse already has a packing
pattern for putting variable-sized structures in reply messages, please
point me to it. 

Details on how fmaps actually work (skip if you don't care)

Note: the fs/famfs links to code on github are to a newer version of
fs/famfs than has been posted to the lists.

In fs/famfs, fmaps are passed into the kernel via the
FAMFSIOC_MAP_CREATE_V2:famfs_file_init_dax_v2() path [8]. This is a
per-file ioctl. Relevant structures are at [9]. The relevant simple and
interleaved structs come as a union after the common header in struct
famfs_ioc_fmap.

With simple extents, struct famfs_ioc_fmap is retrieved via
copy_from_user(), and then the famfs_ioc_simple_extent array is retrieved
based on famfs_ioc_fmap->fioc_nextents from the header struct. See the
SIMPLE_DAX_EXTENT case in famfs_meta_alloc_v2() at [10].

Interleaved extents are a bit more complicated, but they are reasonably
well-documented in a big comment if you scroll up from [9].

The number of interleaved extents is in famfs_ioc_fmap->fioc_niext,
but each interleaved extent has a strip count and for the array of strip
extents (the strip count is famfs_ioc_interleaved_ext->ie_nstrips).

In fuse, we need to put this amalgamation into a variable-sized message
payload. I think it's important for the method of serialization into
messages not to apply arbitrary limits to extent or strip counts - although
a total size limit of no less than 4K might be okay.

I eagerly await suggestions from Miklos or others as to how best to do this,
but I won't go further down this rathole now ;)

Using fmaps to service faults 

Of course the ABI-constrained interchange format (famfs_ioc_fmap) isn't
optimal for in-memory metadata (which should be able to evolve without
breaking the ABI), so it's transmogrified into struct famfs_file_meta
in fs/famfs, which contains a union of extent types (see [11]).

The basic dax/iomap fault handling extent type uses these structs to
translate a file offset to an offset on a daxdev.

A simple extent fault is handled in fs/famfs by famfs_meta_to_dax_offset()
([12]). If it's a the extent type is INTERLEAVED_EXTENT, that function
calls famfs_meta_to_dax_offset_v2() (scroll up from [12]), which resolves a
mapping fault via an interleaved extent.

This fault handling code will need to migrate into fuse_famfs.c (or
fuse_dax_iomap.c, whatever...) in order to handle file mapping faults
efficiently. Because we're enabling memory here - it must run at memory
speeds.

So how does the new metadata integrate into fs/fuse?

One answer would be to attach a famfs flag to files at lookup time, and have
fs/fuse send a new message to the famfs_fused to retrieve the fmap (with
the recipe for serialization of the fmap into the reply message being TBD
as of now). Stefan suggested this new message when we spoke in early
October.

That strikes me as more practical than putting the fmap thingy into the
lookup reply "if needed" - but that is also an option if it's not somehow
impractical or fubar.

OK, what about multiple devdax devices?

The first devdax device is special, in that it's needed at mount time.
The current fs/famfs gets exclusive access to the root daxdev in
famfs_get_tree(), by calling fs_dax_get(...&holder_ops). Note that
fs_dax_get() is patched in by the dev_dax_iomap portion of the fs/famfs
patch set [4], and will be needed in the famfs-fuse patch set as well,
as the iomap api has not previously been available for devdax devices.

I haven't started tackling how this will be hooked into fs/fuse, but it is
a new thing for fs/fuse. 

If we pretend that's covered, the next issue occurs is when fs/fuse
looks-up a famfs file that references a devdax device that hasn't been
referenced previously. The first such occurrence will require providing
daxdev(0) to fs/fuse, and additional instances will require providing
daxdevs 1 onward. 

In user space, daxdevs are known by their uuid's, but they are known to
fmaps by indices (0 being the primary device, and uuid/index mappings will
be managed by the famfs userspace). It seems straightforward for fs/fuse -
when it encounters an fmap that references an as-yet-unknown daxdev index -
to send a new message to famfs_fused requesting that daxdev info by index,
and receive the info in the reply. This approach would avoid the need for
famfs_fused to be stateful wrt which daxdevs are known to fs/fuse.

There are other ways this could be handled, and I'm open to input here - but
I claim it won't work to require that the full daxdev list be populated at
mount time - because adding them later is a very sane use case.

I hope I've done a decent job of framing the initial design problems.
I'm happy to answer questions if anything isn't clear - and I'm
looking forward to your feedback - the more specific the better.

Thanks,
John


[1] https://github.com/cxl-micron-reskit/famfs/blob/master/README.md
[2] https://lpc.events/event/17/contributions/1455/
[3] https://lore.kernel.org/all/8bd01ff0-235f-4aa8-883d-4b71b505b74d@xxxxxxxxxxxxx/T/#m27639915e97443186b3ade9d1e94423bc58e6e22
[4] https://lore.kernel.org/linux-cxl/20240430-badeverbot-paletten-05442cfbbdf0@brauner/T/#mb75fb6522045dca2000d854cfa30de4006a96817
[5] https://www.youtube.com/watch?v=nMaZhXJJgmU
[6] https://lwn.net/Articles/983105/
[7] https://lpc.events/event/18/contributions/1827/
[8] https://github.com/cxl-micron-reskit/famfs-linux/blob/v202410/fs/famfs/famfs_file.c#L482
[9] https://github.com/cxl-micron-reskit/famfs-linux/blob/v202410/include/uapi/linux/famfs_ioctl.h#L102
[10] https://github.com/cxl-micron-reskit/famfs-linux/blob/v202410/fs/famfs/famfs_file.c#L296
[11] https://github.com/cxl-micron-reskit/famfs-linux/blob/v202410/fs/famfs/famfs_internal.h#L18
[12] https://github.com/cxl-micron-reskit/famfs-linux/blob/v202410/fs/famfs/famfs_file.c#L772
[13] https://github.com/cxl-micron-reskit/famfs-linux/blob/v202410/fs/famfs/famfs_inode.c#L252




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux