This patch set introduces famfs[1] - a special-purpose fs-dax file system for sharable disaggregated or fabric-attached memory (FAM). Famfs is not CXL-specific in anyway way. * Famfs creates a simple access method for storing and sharing data in sharable memory. The memory is exposed and accessed as memory-mappable dax files. * Famfs supports multiple hosts mounting the same file system from the same memory (something existing fs-dax file systems don't do). * A famfs file system can be created on a /dev/dax device in devdax mode, which rests on dax functionality added in patches 2-7 of this series. The famfs kernel file system is part the famfs framework; additional components in user space[2] handle metadata and direct the famfs kernel module to instantiate files that map to specific memory. The famfs user space has documentation and a reasonably thorough test suite. The famfs kernel module never accesses the shared memory directly (either data or metadata). Because of this, shared memory managed by the famfs framework does not create a RAS "blast radius" problem that should be able to crash or de-stabilize the kernel. Poison or timeouts in famfs memory can be expected to kill apps via SIGBUS and cause mounts to be disabled due to memory failure notifications. Famfs does not attempt to solve concurrency or coherency problems for apps, although it does solve these problems in regard to its own data structures. Apps may encounter hard concurrency problems, but there are use cases that are imminently useful and uncomplicated from a concurrency perspective: serial sharing is one (only one host at a time has access), and read-only concurrent sharing is another (all hosts can read-cache without worry). Contents: * famfs kernel documentation [patch 1]. Note that evolving famfs user documentation is at [2] * dev_dax_iomap patchset [patches 2-7] - This enables fs-dax to use the iomap interface via a character /dev/dax device (e.g. /dev/dax0.0). For historical reasons the iomap infrastructure was enabled only for /dev/pmem devices (which are dax block devices). As famfs is the first fs-dax file system that works on /dev/dax, this patch series fills in the bare minimum infrastructure to enable iomap api usage with /dev/dax. * famfs patchset [patches 8-12] - this introduces the kernel component of famfs. Note that there is a developing consensus that /dev/dax requires some fundamental re-factoring (e.g. [3]) that is related but outside the scope of this series. Some observations about using sharable memory * It does not make sense to online sharable memory as system-ram. System-ram gets zeroed when it is onlined, so sharing is basically nonsense. * It does not make sense to put struct page's in sharable memory, because those can't be shared. However, separately providing non-sharable capacity to be used for struct page's might be a sensible approach if the size of struct page array for sharable memory is too large to put in conventional system-ram (albeit with possible RAS implications). * Sharable memory is pmem-like, in that a host is likely to connect in order to gain access to data that is already in the memory. Moreover the power domain for shared memory is separate for that of the server. Having observed that, famfs is not intended for persistent storage. It is intended for sharing data sets in memory during a time frame where the memory and the compute nodes are expected to remain operational - such as during a clustered data analytics job. Could we do this with FUSE? The key performance requirement for famfs is efficient handling of VMA faults. This requires caching the complete dax extent lists for all active files so faults can be handled without upcalls, which FUSE does not do. It would probably be possible to put this capability FUSE, but we think that keeping famfs separate from FUSE is the simpler approach. We will be discussing this topic at LSFMM 2024 [5] in a topic called "Famfs: new userspace filesystem driver vs. improving FUSE/DAX" - but other famfs related discussion will also be welcome! This patch set is available as a branch at [6] References [1] https://lpc.events/event/17/contributions/1455/ [2] https://github.com/cxl-micron-reskit/famfs [3] https://lore.kernel.org/all/166630293549.1017198.3833687373550679565.stgit@xxxxxxxxxxxxxxxxxxxxxxxxx/ [4] https://www.computeexpresslink.org/download-the-specification [5] https://events.linuxfoundation.org/lsfmmbpf/program/schedule-at-a-glance/ [6] https://github.com/cxl-micron-reskit/famfs-linux/tree/famfs-v2 Changes since RFC v1: * This patch series is a from-scratch refactor of the original. The code that maps a file to a dax device is almost identical, but a lot of cleanup has been done. * The get_tree and backing device handling code has been ripped up and re-done (in the get-tree case, based on suggestions from Christian Brauner - thanks Christian; I hope I haven't done any new dumb stuff!) (Note this code has been extensively tested; after all known error cases famfs can be umounted and the module can be unloaded) * Famfs now 'shuts down' if the dax device reports any memory errors. I/O and faults start reporting SIGBUS. Famfs detects memory errors via an iomap_ops->notify failure call from the devdax layer. This has been tested and appears to disable the famfs file system while leaving it able to dismount cleanly. * Dropped fault counters * Dropped support for symlinks wtihin a famfs file system; we don't think supporting symlinks makes sense with famfs, and it has some undesirable side effects, so it's out. * Dropped support for mknod within a famfs file system (other than regular files and directories) * Famfs magic number moved to magic.h * Famfs ioctl opcodes now documented in Documentation/userspace-api/ioctl/ioctl-number.rst * Dodgy kerneldoc comments cleaned up or removed; hopefully none added... * Kconfig formatting cleaned up * Dropped /dev/pmem support. Prior patch series would mount on either /dev/pmem or /dev/dax devices. This is unnecessary complexity since /ddev/pmem devices can be converted to /dev/dax. Famfs is, however, the first file system we know of that mounts from a character device. * Famfs no longer does a filp_open() of the dax device. It finds the device by its dev_t and uses fs_dax_get() to effect exclusivity. * Added a read-only module param famfs_kabi_version for checkout that user space was compiled for the same ABI version * The famfs kernel module (the code in fs/famfs plus the uapi file famfs_ioctl.c dropped from 1030 lines of code in v1 to 760 in v2, according to "cloc". * Fixed issues reported by the kernel test robot * Many minor improvements in response to v1 code reviews John Groves (12): famfs: Introduce famfs documentation dev_dax_iomap: Move dax_pgoff_to_phys() from device.c to bus.c dev_dax_iomap: Add fs_dax_get() func to prepare dax for fs-dax usage dev_dax_iomap: Save the kva from memremap dev_dax_iomap: Add dax_operations for use by fs-dax on devdax dev_dax_iomap: export dax_dev_get() famfs prep: Add fs/super.c:kill_char_super() famfs: module operations & fs_context famfs: Introduce inode_operations and super_operations famfs: Introduce file_operations read/write famfs: Introduce mmap and VM fault handling famfs: famfs_ioctl and core file-to-memory mapping logic & iomap_ops Documentation/filesystems/famfs.rst | 135 ++++ Documentation/filesystems/index.rst | 1 + .../userspace-api/ioctl/ioctl-number.rst | 1 + MAINTAINERS | 11 + drivers/dax/Kconfig | 6 + drivers/dax/bus.c | 144 ++++- drivers/dax/dax-private.h | 1 + drivers/dax/device.c | 38 +- drivers/dax/super.c | 33 +- fs/Kconfig | 2 + fs/Makefile | 1 + fs/famfs/Kconfig | 10 + fs/famfs/Makefile | 5 + fs/famfs/famfs_file.c | 605 ++++++++++++++++++ fs/famfs/famfs_inode.c | 452 +++++++++++++ fs/famfs/famfs_internal.h | 52 ++ fs/namei.c | 1 + fs/super.c | 9 + include/linux/dax.h | 6 + include/linux/fs.h | 1 + include/uapi/linux/famfs_ioctl.h | 61 ++ include/uapi/linux/magic.h | 1 + 22 files changed, 1547 insertions(+), 29 deletions(-) create mode 100644 Documentation/filesystems/famfs.rst create mode 100644 fs/famfs/Kconfig create mode 100644 fs/famfs/Makefile create mode 100644 fs/famfs/famfs_file.c create mode 100644 fs/famfs/famfs_inode.c create mode 100644 fs/famfs/famfs_internal.h create mode 100644 include/uapi/linux/famfs_ioctl.h base-commit: ed30a4a51bb196781c8058073ea720133a65596f -- 2.43.0