Re: thoughts about fanotify and HSM

Amir Goldstein <amir73il@xxxxxxxxx> · Tue, 20 Sep 2022 21:19:25 +0300

On Wed, Sep 14, 2022 at 2:52 PM Amir Goldstein <amir73il@xxxxxxxxx> wrote:
>
> > > > > So I'd prefer to avoid the major API
> > > > > extension unless there are serious users out there - perhaps we will even
> > > > > need to develop the kernel API in cooperation with the userspace part to
> > > > > verify the result is actually usable and useful.
> > >
> > > Yap. It should be trivial to implement a "mirror" HSM backend.
> > > For example, the libprojfs [5] projects implements a MirrorProvider
> > > backend for the Microsoft ProjFS [6] HSM API.
> >
> > Well, validating that things work using some simple backend is one thing
> > but we are probably also interested in whether the result is practical to
> > use - i.e., whether the performance meets the needs, whether the API is not
> > cumbersome for what HSM solutions need to do, whether the more advanced
> > features like range-support are useful the way they are implemented etc.
> > We can verify some of these things with simple mirror HSM backend but I'm
> > afraid some of the problems may become apparent only once someone actually
> > uses the result in practice and for that we need a userspace counterpart
> > that does actually something useful so that people have motivation to use
> > it :).
>

Hi Jan,

I wanted to give an update on the POC that I am working on.
I decided to find a FUSE HSM and show how it may be converted
to use fanotify HSM hooks.

HTTPDirFS is a read-only FUSE filesystem that lazyly populates a local
cache from a remote http on first access to every directory and file range.

Normally, it would be run like this:
./httpdirfs --cache-location /vdf/cache https://cdn.kernel.org/pub/ /mnt/pub/

Content is accessed via FUSE mount as /mnt/pub/ and FUSE implements
passthrough calls to the local cache dir if cache is already populated.

After my conversion patches [1], this download-only HSM can be run like
this without mounting FUSE:

sudo ./httpdirfs --fanotify --cache-location /vdf/cache
https://cdn.kernel.org/pub/ -

[1] https://github.com/amir73il/httpdirfs/commits/fanotify_pre_content

Browsing the cache directory at /vdf/cache, lazyly populates the local cache
using FAN_ACCESS_PERM readdir hooks and lazyly downloads files content
using FAN_ACCESS_PERM read hooks.

Up to this point, the implementation did not require any kernel changes.
However, this type of command does not populate the path components,
because lookup does not generate FAN_ACCESS_PERM event:

stat /vdf/cache/data/linux/kernel/firmware/linux-firmware-20220815.tar.gz

To bridge that functionality gap, I've implemented the FAN_LOOKUP_PERM
event [2] and used it to lazyly populate directories in the path ancestry.
For now, I stuck with the XXX_PERM convention and did not require
FAN_CLASS_PRE_CONTENT, although we probably should.

[2] https://github.com/amir73il/linux/commits/fanotify_pre_content

Streaming read of large files works as well, but only for sequential read
patterns. Unlike the FUSE read calls, the FAN_ACCESS_PERM events
do not (yet) carry range info, so my naive implementation downloads
one extra data chunk on each FAN_ACCESS_PERM until the cache file is full.

This makes it possible to run commands like:

tar tvfz /vdf/cache/data/linux/kernel/firmware/linux-firmware-20220815.tar.gz
| less

without having to wait for the entire 400MB file to download before
seeing the first page.

This streaming feature is extremely important for modern HSMs
that are often used to archive large media files in the cloud.

For the next steps of POC, I could do:
- Report FAN_ACCESS_PERM range info to implement random read
  patterns (e.g. unzip -l)
- Introduce FAN_MODIFY_PERM, so file content could be downloaded
  before modifying a read-write HSM cache
- Demo conversion of a read-write FUSE HSM implementation
  (e.g. https://github.com/volga629/davfs2)
- Demo HSM with filesystem mark [*] and a hardcoded test filter

[*] Note that unlike the case with recursive inotify, this POC HSM
implementation is not racy, because of the lookup permission events.
A filesystem mark is still needed to avoid pinning all the unpopulated
cache tree leaf entries to inode cache, so that this HSM could work on
a very large scale tree, the same as my original use case for implementing
filesystem mark.

If what you are looking for is an explanation why fanotify HSM would be better
than a FUSE HSM implementation then there are several reasons.
Performance is at the top of the list. There is this famous USENIX paper [3]
about FUSE passthrough performance.
It is a bit outdated, but many parts are still relevant - you can ask
the Android
developers why they decided to work on FUSE-BFP...

[3] https://www.usenix.org/system/files/conference/fast17/fast17-vangoor.pdf

For me, performance is one of the main concerns, but not the only one,
so I am not entirely convinced that a full FUSE-BFP implementation would
solve all my problems.

When scaling to many millions of passthrough inodes, resource usage start
becoming a limitation of a FUSE passthrough implementation and memory
reclaim of native fs works a lot better than memory reclaim over FUSE over
another native fs.

When the workload works on the native filesystem, it is also possible to
use native fs features (e.g. XFS ioctls).

Questions:
- What do you think about the direction this POC has taken so far?
- Is there anything specific that you would like to see in the POC
  to be convinced that this API will be useful?

Thanks,
Amir.