On Wed, Sep 14, 2022 at 2:52 PM Amir Goldstein <amir73il@xxxxxxxxx> wrote: > > > > > > So I'd prefer to avoid the major API > > > > > extension unless there are serious users out there - perhaps we will even > > > > > need to develop the kernel API in cooperation with the userspace part to > > > > > verify the result is actually usable and useful. > > > > > > Yap. It should be trivial to implement a "mirror" HSM backend. > > > For example, the libprojfs [5] projects implements a MirrorProvider > > > backend for the Microsoft ProjFS [6] HSM API. > > > > Well, validating that things work using some simple backend is one thing > > but we are probably also interested in whether the result is practical to > > use - i.e., whether the performance meets the needs, whether the API is not > > cumbersome for what HSM solutions need to do, whether the more advanced > > features like range-support are useful the way they are implemented etc. > > We can verify some of these things with simple mirror HSM backend but I'm > > afraid some of the problems may become apparent only once someone actually > > uses the result in practice and for that we need a userspace counterpart > > that does actually something useful so that people have motivation to use > > it :). > Hi Jan, I wanted to give an update on the POC that I am working on. I decided to find a FUSE HSM and show how it may be converted to use fanotify HSM hooks. HTTPDirFS is a read-only FUSE filesystem that lazyly populates a local cache from a remote http on first access to every directory and file range. Normally, it would be run like this: ./httpdirfs --cache-location /vdf/cache https://cdn.kernel.org/pub/ /mnt/pub/ Content is accessed via FUSE mount as /mnt/pub/ and FUSE implements passthrough calls to the local cache dir if cache is already populated. After my conversion patches [1], this download-only HSM can be run like this without mounting FUSE: sudo ./httpdirfs --fanotify --cache-location /vdf/cache https://cdn.kernel.org/pub/ - [1] https://github.com/amir73il/httpdirfs/commits/fanotify_pre_content Browsing the cache directory at /vdf/cache, lazyly populates the local cache using FAN_ACCESS_PERM readdir hooks and lazyly downloads files content using FAN_ACCESS_PERM read hooks. Up to this point, the implementation did not require any kernel changes. However, this type of command does not populate the path components, because lookup does not generate FAN_ACCESS_PERM event: stat /vdf/cache/data/linux/kernel/firmware/linux-firmware-20220815.tar.gz To bridge that functionality gap, I've implemented the FAN_LOOKUP_PERM event [2] and used it to lazyly populate directories in the path ancestry. For now, I stuck with the XXX_PERM convention and did not require FAN_CLASS_PRE_CONTENT, although we probably should. [2] https://github.com/amir73il/linux/commits/fanotify_pre_content Streaming read of large files works as well, but only for sequential read patterns. Unlike the FUSE read calls, the FAN_ACCESS_PERM events do not (yet) carry range info, so my naive implementation downloads one extra data chunk on each FAN_ACCESS_PERM until the cache file is full. This makes it possible to run commands like: tar tvfz /vdf/cache/data/linux/kernel/firmware/linux-firmware-20220815.tar.gz | less without having to wait for the entire 400MB file to download before seeing the first page. This streaming feature is extremely important for modern HSMs that are often used to archive large media files in the cloud. For the next steps of POC, I could do: - Report FAN_ACCESS_PERM range info to implement random read patterns (e.g. unzip -l) - Introduce FAN_MODIFY_PERM, so file content could be downloaded before modifying a read-write HSM cache - Demo conversion of a read-write FUSE HSM implementation (e.g. https://github.com/volga629/davfs2) - Demo HSM with filesystem mark [*] and a hardcoded test filter [*] Note that unlike the case with recursive inotify, this POC HSM implementation is not racy, because of the lookup permission events. A filesystem mark is still needed to avoid pinning all the unpopulated cache tree leaf entries to inode cache, so that this HSM could work on a very large scale tree, the same as my original use case for implementing filesystem mark. If what you are looking for is an explanation why fanotify HSM would be better than a FUSE HSM implementation then there are several reasons. Performance is at the top of the list. There is this famous USENIX paper [3] about FUSE passthrough performance. It is a bit outdated, but many parts are still relevant - you can ask the Android developers why they decided to work on FUSE-BFP... [3] https://www.usenix.org/system/files/conference/fast17/fast17-vangoor.pdf For me, performance is one of the main concerns, but not the only one, so I am not entirely convinced that a full FUSE-BFP implementation would solve all my problems. When scaling to many millions of passthrough inodes, resource usage start becoming a limitation of a FUSE passthrough implementation and memory reclaim of native fs works a lot better than memory reclaim over FUSE over another native fs. When the workload works on the native filesystem, it is also possible to use native fs features (e.g. XFS ioctls). Questions: - What do you think about the direction this POC has taken so far? - Is there anything specific that you would like to see in the POC to be convinced that this API will be useful? Thanks, Amir.