Initial reply to both Amir and Miklos. Sorry for the delay - I took a few days off after LSFMM and I'm just re-engaging now. First an observation: these messages are on the famfs v1 patch set thread. The v2 patch set is at [1]. That is also the default branch now if you clone the famfs kernel from [2]. Among the biggest changes at v2 is dropping /dev/pmem support and only supporting /dev/dax (character) devices as backing devs for famfs. On 24/05/19 08:59AM, Amir Goldstein wrote: > On Fri, May 17, 2024 at 12:55 PM Miklos Szeredi <miklos@xxxxxxxxxx> wrote: > > > > On Thu, 29 Feb 2024 at 07:52, Amir Goldstein <amir73il@xxxxxxxxx> wrote: > > > > > I'm not virtiofs expert, but I don't think that you are wrong about this. > > > IIUC, virtiofsd could map arbitrary memory region to any fuse file mmaped > > > by virtiofs client. > > > > > > So what are the gaps between virtiofs and famfs that justify a new filesystem > > > driver and new userspace API? > > > > Let me try to fill in some gaps. I've looked at the famfs driver > > (even tried to set it up in a VM, but got stuck with the EFI stuff). I'm happy to help with that if you care - ping me if so; getting a VM running in EFI mode is not necessary if you reserve the dax memory via memmap=, or via libvirt xml. > > > > - famfs has an extent list per file that indicates how each page > > within the file should be mapped onto the dax device, IOW it has the > > following mapping: > > > > [famfs file, offset] -> [offset, length] More generally, a famfs file extent is [daxdev, offset, len]; there may be multiple extents per file, and in the future this definitely needs to generalize to multiple daxdev's. Disclaimer: I'm still coming up to speed on fuse (slowly and ignorantly, I think)... A single backing device (daxdev) will contain extents of many famfs files (plus metadata - currently a superblock and a log). I'm not sure it's realistic to have a backing daxdev "open" per famfs file. In addition there is: - struct dax_holder_operations - to allow a notify_failure() upcall from dax. This provides the critical capability to shut down famfs if there are memory errors. This is filesystem- (or technically daxdev- wide) - The pmem or devdax iomap_ops - to allow the fsdax file system (famfs, and [soon] famfs_fuse) to call dax_iomap_rw() and dax_iomap_fault(). I strongly suspect that famfs_fuse can't be correct unless it uses this path rather than just the idea of a single backing file. This interface explicitly supports files that map to disjoint ranges of one or more dax devices. - the dev_dax_iomap portion of the famfs patchsets adds iomap_ops to character devdax. - Note that dax devices, unlike files, don't support read/write - only mmap(). I suspect (though I'm still pretty ignorant) that this means we can't just treat the dax device as an extent-based backing file. > > > > - fuse can currently map a fuse file onto a backing file: > > > > [fuse file] -> [backing file] > > > > The interface for the latter is > > > > backing_id = ioctl(dev_fuse_fd, FUSE_DEV_IOC_BACKING_OPEN, backing_map); > > ... > > fuse_open_out.flags |= FOPEN_PASSTHROUGH; > > fuse_open_out.backing_id = backing_id; > > FYI, library and example code was recently merged to libfuse: > https://github.com/libfuse/libfuse/pull/919 > > > > > This looks suitable for doing the famfs file - > dax device mapping as > > well. I wouldn't extend the ioctl with extent information, since > > famfs can just use FUSE_DEV_IOC_BACKING_OPEN once to register the dax > > device. The flags field could be used to tell the kernel to treat > > this fd as a dax device instead of a a regular file. A dax device to famfs is a lot more like a backing device for a "filesystem" than a backing file for another file. And, as previously mentioned, there is the iomap_ops interface and the holder_ops interface that deal with multiple file tenants on a dax device (plus error notification, respectively) Probably doable, but important distinctions... > > > > Letter, when the file is opened the extent list could be sent in the > > open reply together with the backing id. The fuse_ext_header > > mechanism seems suitable for this. > > > > And I think that's it as far as API's are concerned. > > > > Note: this is already more generic than the current famfs prototype, > > since multiple dax devices could be used as backing for famfs files, > > with the constraint that a single file can only map data from a single > > dax device. > > > > As for implementing dax passthrough, I think that needs a separate > > source file, the one used by virtiofs (fs/fuse/dax.c) does not appear > > to have many commonalities with this one. That could be renamed to > > virtiofs_dax.c as it's pretty much virtiofs specific, AFAICT. > > > > Comments? > > Would probably also need to decouple CONFIG_FUSE_DAX > from CONFIG_FUSE_VIRTIO_DAX. > > What about fc->dax_mode (i.e. dax= mount option)? > > What about FUSE_IS_DAX()? does it apply to both dax implementations? > > Sounds like a decent plan. > John, let us know if you need help understanding the details. I'm certain I will need some help, but I'll try to do my part. First question: can you suggest an example fuse file pass-through file system that I might use as a jumping-off point? Something that gets the basic pass-through capability from which to start hacking in famfs/dax capabilities? When I started on famfs, I used ramfs because it got me all the basic file system functionality minus a backing store. Then I built the dax functionality by referring to xfs. > > > Am I missing something significant? > > Would we need to set IS_DAX() on inode init time or can we set it > later on first file open? > > Currently, iomodes enforces that all opens are either > mapped to same backing file or none mapped to backing file: > > fuse_inode_uncached_io_start() > { > ... > /* deny conflicting backing files on same fuse inode */ > > The iomodes rules will need to be amended to verify that: > - IS_DAX() inode open is always mapped to backing dax device > - All files of the same fuse inode are mapped to the same range > of backing file/dax device. I'm confused by the last item. I would think there would be a fuse inode per famfs file, and that multiple of those would map to separate extent lists of one or more backing dax devices. Or maybe I misunderstand the meaning of "fuse inode". Feel free to assign reading... > > Thanks, > Amir. Thanks Miklos and Amir, John [1] https://lore.kernel.org/linux-fsdevel/cover.1714409084.git.john@xxxxxxxxxx/T/#m3b11e8d311eca80763c7d6f27d43efd1cdba628b [2] https://github.com/cxl-micron-reskit/famfs-linux