Re: thoughts about fanotify and HSM

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu 22-09-22 16:03:41, Amir Goldstein wrote:
> On Thu, Sep 22, 2022 at 1:48 PM Jan Kara <jack@xxxxxxx> wrote:
> > On Tue 20-09-22 21:19:25, Amir Goldstein wrote:
> > > For the next steps of POC, I could do:
> > > - Report FAN_ACCESS_PERM range info to implement random read
> > >   patterns (e.g. unzip -l)
> > > - Introduce FAN_MODIFY_PERM, so file content could be downloaded
> > >   before modifying a read-write HSM cache
> > > - Demo conversion of a read-write FUSE HSM implementation
> > >   (e.g. https://github.com/volga629/davfs2)
> > > - Demo HSM with filesystem mark [*] and a hardcoded test filter
> > >
> > > [*] Note that unlike the case with recursive inotify, this POC HSM
> > > implementation is not racy, because of the lookup permission events.
> > > A filesystem mark is still needed to avoid pinning all the unpopulated
> > > cache tree leaf entries to inode cache, so that this HSM could work on
> > > a very large scale tree, the same as my original use case for implementing
> > > filesystem mark.
> >
> > Sounds good! Just with your concern about pinning - can't you use evictable
> > marks added on lookup for files / dirs you want to track? Maybe it isn't
> > great design for other reasons but it would save you some event
> > filtering...
> >
> 
> With the current POC, there is no trigger to re-establish the evicted mark,
> because the parent is already populated and has no mark.

So my original thinking was that you'd place FAN_LOOKUP_PERM mark on top of
the directory tree and then you'd add evictable marks to all the subdirs
that are looked up from the FAN_LOOKUP_PERM event handler. That way I'd
imagine you can place evictable marks on all directories that are used in a
race-free manner.

> A hook on instantiate of inode in inode cache could fill that gap.
> It could still be useful to filter FAN_INSTANTIATE_PERM events in the
> kernel but it is not a must because instantiate is more rare than (say) lookup
> and then the fast lookup path (RCU walk) on populated trees suffers almost
> no overhead when the filesystem is watched.
> 
> Please think about this and let me know if you think that this is a direction
> worth pursuing, now, or as a later optimization.

I think an event on instantiate seems to be depending too much on kernel
internals instead of obvious filesystem operations. Also it might be a bit
challenging during startup when you don't know what is cached and what not
so you cannot rely on instantiate events for placing marks. So I'd leave
this for future optimization.

> > > If what you are looking for is an explanation why fanotify HSM would be better
> > > than a FUSE HSM implementation then there are several reasons.
> > > Performance is at the top of the list. There is this famous USENIX paper [3]
> > > about FUSE passthrough performance.
> > > It is a bit outdated, but many parts are still relevant - you can ask
> > > the Android
> > > developers why they decided to work on FUSE-BFP...
> > >
> > > [3] https://www.usenix.org/system/files/conference/fast17/fast17-vangoor.pdf
> > >
> > > For me, performance is one of the main concerns, but not the only one,
> > > so I am not entirely convinced that a full FUSE-BFP implementation would
> > > solve all my problems.
> > >
> > > When scaling to many millions of passthrough inodes, resource usage start
> > > becoming a limitation of a FUSE passthrough implementation and memory
> > > reclaim of native fs works a lot better than memory reclaim over FUSE over
> > > another native fs.
> > >
> > > When the workload works on the native filesystem, it is also possible to
> > > use native fs features (e.g. XFS ioctls).
> >
> > OK, understood. Out of curiosity you've mentioned you'd looked into
> > implementing HSM in overlayfs. What are the issues there? I assume
> > performance is very close to native one so that is likely not an issue and
> > resource usage you mention above likely is not that bad either. So I guess
> > it is that you don't want to invent hooks for userspace for moving (parts
> > of) files between offline storage and the local cache?
> 
> In a nutshell, when realizing that overlayfs needs userspace hooks
> to cater HSM, it becomes quite useless to use a stacked fs design.
> 
> Performance is not a problem with overlayfs, but like with FUSE,
> all the inodes/dentries in the system double, memory reclaim
> of layered fs becomes an awkward dance, which messes with the
> special logic of xfs shrinkers, and on top of all this, overlayfs does
> not proxy all the XFS ioctls either.
> 
> The fsnotify hooks are a much better design when realizing that
> the likely() case is to do nothing and incur least overhead and
> the unlikely() case of user hook is rare.

OK, understood. Thanks!

> > The remaining concern I have is that we should demonstrate the solution is
> > able to scale to millions of inodes (and likely more) because AFAIU that
> > are the sizes current HSM solutions are interested in. I guess this is kind
> > of covered in your last step of POCs though.
> >
> 
> Well, in $WORK we have performance test setups for those workloads,
> so part of my plan is to convert the in-house FUSE HSM
> to fanotify and make sure that all those tests do not regress.
> But that is not code, nor tests that I can release, I can only report back
> that the POC works and show the building blocks that I used on
> some open source code base.

Even this is useful I think.

> I plan to do the open source small scale POC first to show the
> building blocks so you could imagine the end results and
> then take the building blocks for a test drive in the real world.
> 
> I've put my eye on davfs2 [1] as the code base for read-write HSM
> POC, but maybe I will find an S3 FUSE fs that could work too
> I am open to other suggestions.
> 
> [1] https://github.com/volga629/davfs2
> 
> When DeepSpace Storage releases their product to github,
> I will be happy to work with them on a POC with their code
> base and I bet they could arrange a large scale test setup.
> (hint hint).

:-)

							Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR



[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux