Re: Btrfs and fanotify filesystem watches

Amir Goldstein <amir73il@xxxxxxxxx> · Fri, 23 Nov 2018 15:34:25 +0200

On Fri, Nov 23, 2018 at 2:52 PM Jan Kara <jack@xxxxxxx> wrote:
>
> Changed subject to better match what we discuss and added btrfs list to CC.
>
> On Thu 22-11-18 17:18:25, Amir Goldstein wrote:
> > On Thu, Nov 22, 2018 at 3:26 PM Jan Kara <jack@xxxxxxx> wrote:
> > >
> > > On Thu 22-11-18 14:36:35, Amir Goldstein wrote:
> > > > > > Regardless, IIUC, btrfs_statfs() returns an fsid which is associated with
> > > > > > the single super block struct, so all dentries in all subvolumes will
> > > > > > return the same fsid: btrfs_sb(dentry->d_sb)->fsid.
> > > > >
> > > > > This is not true AFAICT. Looking at btrfs_statfs() I can see:
> > > > >
> > > > >         buf->f_fsid.val[0] = be32_to_cpu(fsid[0]) ^ be32_to_cpu(fsid[2]);
> > > > >         buf->f_fsid.val[1] = be32_to_cpu(fsid[1]) ^ be32_to_cpu(fsid[3]);
> > > > >         /* Mask in the root object ID too, to disambiguate subvols */
> > > > >         buf->f_fsid.val[0] ^=
> > > > >                 BTRFS_I(d_inode(dentry))->root->root_key.objectid >> 32;
> > > > >         buf->f_fsid.val[1] ^=
> > > > >                 BTRFS_I(d_inode(dentry))->root->root_key.objectid;
> > > > >
> > > > > So subvolume root is xored into the FSID. Thus dentries from different
> > > > > subvolumes are going to return different fsids...
> > > > >
> > > >
> > > > Right... how could I have missed that :-/
> > > >
> > > > I do not want to bring back FSNOTIFY_EVENT_DENTRY just for that
> > > > and I saw how many flaws you pointed to in $SUBJECT patch.
> > > > Instead, I will use:
> > > > statfs_by_dentry(d_find_any_alias(inode) ?: inode->i_sb->sb_root,...
> > >
> > > So what about my proposal to store fsid in the notification mark when it is
> > > created and then use it when that mark results in event being generated?
> > > When mark is created, we have full path available, so getting proper fsid
> > > is trivial. Furthermore if the behavior is documented like that, it is
> > > fairly easy for userspace to track fsids it should care about and translate
> > > them to proper file descriptors for open_by_handle().
> > >
> > > This would get rid of statfs() on every event creation (which I don't like
> > > much) and also avoids these problems "how to get fsid for inode". What do
> > > you think?
> > >
> >
> > That's interesting. I like the simplicity.
> > What happens when there are 2 btrfs subvols /subvol1 /subvol2
> > and the application obviously doesn't know about this and does:
> > fanotify_mark(fd, FAN_MARK_ADD|FAN_MARK_FILESYSTEM, ... /subvol1);
> > statfs("/subvol1",...);
> > fanotify_mark(fd, FAN_MARK_ADD|FAN_MARK_FILESYSTEM, ... /subvol2);
> > statfs("/subvol2",...);
> >
> > Now the second fanotify_mark() call just updates the existing super block
> > mark mask, but does not change the fsid on the mark, so all events will have
> > fsid of subvol1 that was stored when first creating the mark.
>
> Yes.
>
> > fanotify-watch application (for example) hashes the watches (paths) under
> > /subvol2 by fid with fsid of subvol2, so events cannot get matched back to
> > "watch" (i.e. path).
>
> I agree this can be confusing... but with btrfs fanotify-watch will be
> confused even with your current code, won't it? Because FAN_MARK_FILESYSTEM
> on /subvol1 (with fsid A) is going to return also events on inodes from
> /subvol2 (with fsid B). So your current code will return event with fsid B
> which fanotify-watch has no way to match back and can get confused.
>
> So currently application can get events with fsid it has never seen, with
> the code as I suggest it can get "wrong" fsid. That is confusing but still
> somewhat better?

It's not better IMO because the purpose of the fid in event is a unique key
that application can use to match a path it already indexed.
wrong fsid => no match.
Using open_by_handle_at() to resolve unknown fid is a limited solution
and I don't think it is going to work in this case (see below).

>
> The core of the problem is that btrfs does not have "the superblock
> identifier" that would correspond to FAN_MARK_FILESYSTEM scope of events
> that we could use.
>
> > And when trying to open_by_handle fid with fhandle from /subvol2
> > using mount_fd of /subvol1, I suppose we can either get ESTALE
> > or a disconnected dentry, because object from /subvol2 cannot
> > have a path inside /subvol1....
>
> So open_by_handle() should work fine even if we get mount_fd of /subvol1
> and handle for inode on /subvol2. mount_fd in open_by_handle() is really
> only used to get the superblock and that is the same.

I don't think it will work fine.
do_handle_to_path() will compose a path from /subvol1 mnt and dentry from
/subvol2. This may resolve to a full path that does not really exist,
so application
cannot match it to anything that is can use name_to_handle_at() to identify.

The whole thing just sounds too messy. At the very least we need more time
to communicate with btrfs developers and figure this out, so I am going to
go with -EXDEV for any attempt to set *any* mark on a group with
FAN_REPORT_FID if fsid of fanotify_mark() path argument is different
from fsid of path->dentry->d_sb->s_root.

We can relax that later if we figure out a better way.

BTW, I am also going to go with -ENODEV for zero fsid (e.g. tmpfs).
tmpfs can be easily fixed to have non zero fsid if filesystem watch on tmpfs is
important.

Thanks,
Amir.