Re: [RFC][PATCH 0/2] Monitoring unmounted fs with fanotify

Christian Brauner <brauner@xxxxxxxxxx> · Wed, 19 Apr 2023 19:19:01 +0200

On Tue, Apr 18, 2023 at 06:20:22PM +0300, Amir Goldstein wrote:
> On Tue, Apr 18, 2023 at 5:12 PM Christian Brauner <brauner@xxxxxxxxxx> wrote:
> >
> > On Tue, Apr 18, 2023 at 04:56:40PM +0300, Amir Goldstein wrote:
> > > On Tue, Apr 18, 2023 at 4:33 PM Christian Brauner <brauner@xxxxxxxxxx> wrote:
> > > >
> > > > On Fri, Apr 14, 2023 at 09:29:01PM +0300, Amir Goldstein wrote:
> > > > > Jan,
> > > > >
> > > > > Followup on my quest to close the gap with inotify functionality,
> > > > > here is a proposal for FAN_UNMOUNT event.
> > > > >
> > > > > I have had many design questions about this:
> > > >
> > > > I'm going to humbly express what I feel makes sense to me when looking
> > > > at this from a user perspective:
> > > >
> > > > > 1) Should we also report FAN_UNMOUNT for marked inodes and sb
> > > > >    on sb shutdown (same as IN_UNMOUNT)?
> > > >
> > > > My preference would be if this would be a separate event type.
> > > > FAN_SB_SHUTDOWN or something.
> > >
> > > If we implement an event for this at all, I would suggest FAN_IGNORED
> > > or FAN_EVICTED, which has the same meaning as IN_IGNORED.
> > > When you get an event that the watch went away, it could be because of:
> > > 1. watch removed by user
> > > 2. watch removed because inode was evicted (with FAN_MARK_EVICTABLE)
> > > 3. inode deleted
> > > 4. sb shutdown
> > >
> > > IN_IGNORED is generated in all of the above except for inode evict
> > > that is not possible with inotify.
> > >
> > > User can figure out on his own if the inode was deleted or if fs was unmounted,
> > > so there is not really a need for FAN_SB_SHUTDOWN IMO.
> >
> > Ok, sounds good.
> >
> > >
> > > Actually, I think that FAN_IGNORED would be quite useful for the
> > > FAN_MARK_EVICTABLE case, but it is a bit less trivial to implement
> > > than FAN_UNMOUNT was.
> > >
> > > >
> > > > > 2) Should we also report FAN_UNMOUNT on sb mark for any unmounts
> > > > >    of that sb?
> > > >
> > > > I don't think so. It feels to me that if you watch an sb you don't
> > > > necessarily want to watch bind mounts of that sb.
> > > >
> > > > > 3) Should we report also the fid of the mount root? and if we do...
> > > > > 4) Should we report/consider FAN_ONDIR filter?
> > > > >
> > > > > All of the questions above I answered "not unless somebody requests"
> > > > > in this first RFC.
> > > >
> > > > Fwiw, I agree.
> > > >
> > > > >
> > > > > Specifically, I did get a request for an unmount event for containers
> > > > > use case.
> > > > >
> > > > > I have also had doubts regarding the info records.
> > > > > I decided that reporting fsid and mntid is minimum, but couldn't
> > > > > decide if they were better of in a single MNTID record or seprate
> > > > > records.
> > > > >
> > > > > I went with separate records, because:
> > > > > a) FAN_FS_ERROR has set a precendent of separate fid record with
> > > > >    fsid and empty fid, so I followed this precendent
> > > > > b) MNTID record we may want to add later with FAN_REPORT_MNTID
> > > > >    to all the path events, so better that it is independent
> > > >
> > >
> > > Just thought of another reason:
> > >  c) FAN_UNMOUNT does not need to require FAN_REPORT_FID
> > >      so it does not depend on filesystem having a valid f_fsid nor
> > >      exports_ops. In case of "pseudo" fs, FAN_UNMOUNT can report
> > >      only MNTID record (I will amend the patch with this minor change).
> >
> > I see some pseudo fses generate f_fsid, e.g., tmpfs in mm/shmem.c
> 
> tmpfs is not "pseudo" in my eyes, because it implements a great deal of the
> vfs interfaces, including export_ops.

The term "pseudo" is somewhat well-defined though, no? It really just
means that there's no backing device associated with it. So for example,
anything that uses get_tree_nodev() including tmpfs. If erofs is
compiled with fscache support it's even a pseudo fs (TIL).

> 
> and also I fixed its f_fsid recently:
> 59cda49ecf6c shmem: allow reporting fanotify events with file handles on tmpfs

Well thank you for that this has been very useful in userspace already
I've been told.

> 
> > At the risk of putting my foot in my mouth, what's stopping us from
> > making them all support f_fsid?
> 
> Nothing much. Jan had the same opinion [1].

I think that's what we should try to do without having thought too much
about potential edge-cases.

> 
> We could do either:
> 1. use uuid_to_fsid() in vfs_statfs() if fs has set s_uuid and not set f_fsid
> 2. use s_dev as f_fsid in vfs_statfs() if fs did not set f_fsid nor s_uuid
> 3. randomize s_uuid for simple fs (like tmpfs)
> 4. any combination of the above and more
> 
> Note that we will also need to decide what to do with
> name_to_handle_at() for those pseudo fs.

Doing it on the fly during vfs_statfs() feels a bit messy and could
cause bugs. One should never underestimate the possibility that there's
some fs that somehow would get into trouble because of odd behavior.

So switching each fs over to generate a s_uuid seems the prudent thing
to do. Doing it the hard way also forces us to make sure that each
filesystem can deal with this.

It seems that for pseudo fses we can just allocate a new s_uuid for each
instance. So each tmpfs instance - like your patch did - would just get
a new s_uuid.

For kernel internal filesystems - mostly those that use init_pseudo -
the s_uuid would remain stable until the next reboot when it is
regenerated.

Looking around just a little there's some block-backed fses like fat
that have an f_fsid but no s_uuid. So if we give those s_uuid then it'll
mean that the f_fsid isn't generated based on the s_uuid. That should be
ok though and shouldn't matter to userspace.

Afterwards we could probably lift the ext4 and xfs specific ioctls to
retrieve the s_uuid into a generic ioctl to allow userspace to get the
s_uuid.

That's my thinking without having crawled to all possible corner
cases... Also needs documenting that s_uuid is not optional anymore and
explain the difference between pseudo and device-backed fses. I hope
that's not completely naive...

> 
> Quoting Jan from [1]:
> "But otherwise the proposal to make name_to_handle_at() work even for
> filesystems not exportable through NFS makes sense to me. But I guess we
> need some buy-in from VFS maintainers for this." (hint hint).
> 
> Thanks,
> Amir.
> 
> [1] https://lore.kernel.org/linux-fsdevel/20230417162721.ouzs33oh6mb7vtft@quack3/