[RFC PATCH v2 00/12] Introduce the famfs shared-memory file system

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]


This patch set introduces famfs[1] - a special-purpose fs-dax file system
for sharable disaggregated or fabric-attached memory (FAM). Famfs is not
CXL-specific in anyway way.

* Famfs creates a simple access method for storing and sharing data in
  sharable memory. The memory is exposed and accessed as memory-mappable
  dax files.
* Famfs supports multiple hosts mounting the same file system from the
  same memory (something existing fs-dax file systems don't do).
* A famfs file system can be created on a /dev/dax device in devdax mode,
  which rests on dax functionality added in patches 2-7 of this series.

The famfs kernel file system is part the famfs framework; additional
components in user space[2] handle metadata and direct the famfs kernel
module to instantiate files that map to specific memory. The famfs user
space has documentation and a reasonably thorough test suite.

The famfs kernel module never accesses the shared memory directly (either
data or metadata). Because of this, shared memory managed by the famfs
framework does not create a RAS "blast radius" problem that should be able
to crash or de-stabilize the kernel. Poison or timeouts in famfs memory
can be expected to kill apps via SIGBUS and cause mounts to be disabled
due to memory failure notifications.

Famfs does not attempt to solve concurrency or coherency problems for apps,
although it does solve these problems in regard to its own data structures.
Apps may encounter hard concurrency problems, but there are use cases that
are imminently useful and uncomplicated from a concurrency perspective:
serial sharing is one (only one host at a time has access), and read-only
concurrent sharing is another (all hosts can read-cache without worry).


* famfs kernel documentation [patch 1]. Note that evolving famfs user
  documentation is at [2]
* dev_dax_iomap patchset [patches 2-7] - This enables fs-dax to use the
  iomap interface via a character /dev/dax device (e.g. /dev/dax0.0). For
  historical reasons the iomap infrastructure was enabled only for
  /dev/pmem devices (which are dax block devices). As famfs is the first
  fs-dax file system that works on /dev/dax, this patch series fills in
  the bare minimum infrastructure to enable iomap api usage with /dev/dax.
* famfs patchset [patches 8-12] - this introduces the kernel component of

Note that there is a developing consensus that /dev/dax requires
some fundamental re-factoring (e.g. [3]) that is related but outside the
scope of this series.

Some observations about using sharable memory

* It does not make sense to online sharable memory as system-ram.
  System-ram gets zeroed when it is onlined, so sharing is basically
* It does not make sense to put struct page's in sharable memory, because
  those can't be shared. However, separately providing non-sharable
  capacity to be used for struct page's might be a sensible approach if the
  size of struct page array for sharable memory is too large to put in
  conventional system-ram (albeit with possible RAS implications).
* Sharable memory is pmem-like, in that a host is likely to connect in
  order to gain access to data that is already in the memory. Moreover
  the power domain for shared memory is separate for that of the server.
  Having observed that, famfs is not intended for persistent storage. It is
  intended for sharing data sets in memory during a time frame where the
  memory and the compute nodes are expected to remain operational - such
  as during a clustered data analytics job.

Could we do this with FUSE?

The key performance requirement for famfs is efficient handling of VMA
faults. This requires caching the complete dax extent lists for all active
files so faults can be handled without upcalls, which FUSE does not do.
It would probably be possible to put this capability FUSE, but we think
that keeping famfs separate from FUSE is the simpler approach.

We will be discussing this topic at LSFMM 2024 [5] in a topic called "Famfs:
new userspace filesystem driver vs. improving FUSE/DAX" - but other famfs
related discussion will also be welcome!

This patch set is available as a branch at [6]


[1] https://lpc.events/event/17/contributions/1455/
[2] https://github.com/cxl-micron-reskit/famfs
[3] https://lore.kernel.org/all/166630293549.1017198.3833687373550679565.stgit@xxxxxxxxxxxxxxxxxxxxxxxxx/
[4] https://www.computeexpresslink.org/download-the-specification
[5] https://events.linuxfoundation.org/lsfmmbpf/program/schedule-at-a-glance/
[6] https://github.com/cxl-micron-reskit/famfs-linux/tree/famfs-v2

Changes since RFC v1:

* This patch series is a from-scratch refactor of the original. The code
  that maps a file to a dax device is almost identical, but a lot of
  cleanup has been done.
* The get_tree and backing device handling code has been ripped up and
  re-done (in the get-tree case, based on suggestions from Christian
  Brauner - thanks Christian; I hope I haven't done any new dumb stuff!)
  (Note this code has been extensively tested; after all known error cases
  famfs can be umounted and the module can be unloaded)
* Famfs now 'shuts down' if the dax device reports any memory errors. I/O
  and faults start reporting SIGBUS. Famfs detects memory errors via an
  iomap_ops->notify failure call from the devdax layer. This has been tested
  and appears to disable the famfs file system while leaving it able to
  dismount cleanly.
* Dropped fault counters
* Dropped support for symlinks wtihin a famfs file system; we don't think
  supporting symlinks makes sense with famfs, and it has some undesirable
  side effects, so it's out.
* Dropped support for mknod within a famfs file system (other than regular
  files and directories)
* Famfs magic number moved to magic.h
* Famfs ioctl opcodes now documented in
* Dodgy kerneldoc comments cleaned up or removed; hopefully none added...
* Kconfig formatting cleaned up
* Dropped /dev/pmem support. Prior patch series would mount on either
  /dev/pmem or /dev/dax devices. This is unnecessary complexity since
  /ddev/pmem devices can be converted to /dev/dax. Famfs is, however, the
  first file system we know of that mounts from a character device.
* Famfs no longer does a filp_open() of the dax device. It finds the
  device by its dev_t and uses fs_dax_get() to effect exclusivity.
* Added a read-only module param famfs_kabi_version for checkout
  that user space was compiled for the same ABI version
* The famfs kernel module (the code in fs/famfs plus the uapi file
  famfs_ioctl.c dropped from 1030 lines of code in v1 to 760 in v2,
  according to "cloc".
* Fixed issues reported by the kernel test robot
* Many minor improvements in response to v1 code reviews

John Groves (12):
  famfs: Introduce famfs documentation
  dev_dax_iomap: Move dax_pgoff_to_phys() from device.c to bus.c
  dev_dax_iomap: Add fs_dax_get() func to prepare dax for fs-dax usage
  dev_dax_iomap: Save the kva from memremap
  dev_dax_iomap: Add dax_operations for use by fs-dax on devdax
  dev_dax_iomap: export dax_dev_get()
  famfs prep: Add fs/super.c:kill_char_super()
  famfs: module operations & fs_context
  famfs: Introduce inode_operations and super_operations
  famfs: Introduce file_operations read/write
  famfs: Introduce mmap and VM fault handling
  famfs: famfs_ioctl and core file-to-memory mapping logic & iomap_ops

 Documentation/filesystems/famfs.rst           | 135 ++++
 Documentation/filesystems/index.rst           |   1 +
 .../userspace-api/ioctl/ioctl-number.rst      |   1 +
 MAINTAINERS                                   |  11 +
 drivers/dax/Kconfig                           |   6 +
 drivers/dax/bus.c                             | 144 ++++-
 drivers/dax/dax-private.h                     |   1 +
 drivers/dax/device.c                          |  38 +-
 drivers/dax/super.c                           |  33 +-
 fs/Kconfig                                    |   2 +
 fs/Makefile                                   |   1 +
 fs/famfs/Kconfig                              |  10 +
 fs/famfs/Makefile                             |   5 +
 fs/famfs/famfs_file.c                         | 605 ++++++++++++++++++
 fs/famfs/famfs_inode.c                        | 452 +++++++++++++
 fs/famfs/famfs_internal.h                     |  52 ++
 fs/namei.c                                    |   1 +
 fs/super.c                                    |   9 +
 include/linux/dax.h                           |   6 +
 include/linux/fs.h                            |   1 +
 include/uapi/linux/famfs_ioctl.h              |  61 ++
 include/uapi/linux/magic.h                    |   1 +
 22 files changed, 1547 insertions(+), 29 deletions(-)
 create mode 100644 Documentation/filesystems/famfs.rst
 create mode 100644 fs/famfs/Kconfig
 create mode 100644 fs/famfs/Makefile
 create mode 100644 fs/famfs/famfs_file.c
 create mode 100644 fs/famfs/famfs_inode.c
 create mode 100644 fs/famfs/famfs_internal.h
 create mode 100644 include/uapi/linux/famfs_ioctl.h

base-commit: ed30a4a51bb196781c8058073ea720133a65596f

[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux