Re: [RFC][PATCH 0/7] fanotify: add support for more events

Amir Goldstein <amir73il@xxxxxxxxx> · Wed, 12 Oct 2016 14:49:10 +0300

On Oct 11, 2016 9:42 PM, "Jan Kara" <jack@xxxxxxx> wrote:
>
> Hi,
>
> On Mon 10-10-16 22:12:57, Amir Goldstein wrote:
> > This series is a prep work for using fanotify to monitor all
> > events in a file system with a single watch.
> >
> > The end result is indented to be an alternative to the recursive
> > inotify watches scheme, which has its problems.
> >
> > This first part adds support for most inotify events to fanotify
> > when watching a directory.
>
> Well, and how is this better than what inotify supports? The proclaimed
> advantage of fanotify() was that you get open file handle in the event
> which prevents races when directory tree changes before you manage to
> process inotify event. However with directory changes, you are again
> starting to report names and so things become racy again. So I don't quite
> see a value of reimplementing this in fanotify.

For this part, I did not intend to add any benefit compared to inotify.
This is prep work to align fanotify with inotify feature set before I begin to
enhance fanotify to surpass inotify feature set. Besides, inotify was deemed
a bad API (rightfully IMO), and since my work requires adding more info to
the event data I expected that any attempt to modify the inotify API
would be rejected and re-directed towards using the newer and more flexible
fanotify API instead.

Also, please note that there is still a subtle and important difference between
inotify dir watch and fanotify dir watch presented by this work.
The child file changes are still reported with fd of the accessed
files themselves.
The dir entry name changes are reported with fd of the dir so there is
not really
any race issues as far as I can tell, only a recorded history of entry
name changes.
In fact, it is easy to report fd of the file in case of create and
move-to events,
but I chose the semantics of identifying the directory for all entry
name changes.

> The only benefit I see is
> that you can watch the whole mountpoint with one watch instead of having to
> add watch to every directory. Is that such a huge win? What is a use case
> for that?
>

Our use case is watching over 10M directories for continuous backup.
Maybe much more and for a workload where most of those directories are
seldom accessed. Inotify pins the directory inode for every watched directory.
This is not very scalable in terms of memory usage.

> > The next part will add support for watching a super block,
> > which is not the same as watching a mount point.
>
> Careful here. In the world of user namespaces and containers you have to be
> really careful so that events from one container don't leak into another
> container despite they live in the same physical filesystem, just a
> different bind mount. I believe chroot / bind mounts was one of the reasons
> why fanotify ended with mountpoints and not with superblocks. But I guess
> Eric or Al remember this better.
>

[cc ebiederman]

There are 2 sides to this argument. On the one hand, leaking file access
information (pid info and read/perm access events for that matter) across
namespaces should be done carefully or not done at all.
On the other hand, no namespace has exclusive ownership of the file system.
If a directory is accessible from several mount points be it different
namespaces or not, file system modifications affect all mounts from which
this directory is visible, so it makes sense to notify listeners from any mount
at least about MODIFY|ATTRIB|CREATE|MOVE|DELETE events in that directory.

Further more, fanotify currently requires CAP_ADMIN and I intend to enforce the
same capabilties required to mount a super block for watching a super block.
So containers with user namespace or with no CAP_ADMIN won't be able to use this
feature, but the container daemon will be able to use it to monitor
all files accessible
to the host. In my eyes this is a usability improvement, not an
information leak.

> > I am posting this WIP to get feedback on the idea and to find
> > out if there are any users out there interested in the improved
> > fanotify capabilities and/or in the super block monitoring
> > use case.
>
> Well, I hope you have some usecases in mind when you propose this ;) I've
> been asked about fanotify for superblocks

For file access events only? or for directory changes as well?
Because that feature is not avaialbe with fanotify mount watch
(neither with my work).

> but so far I think that doing it
> in kernel would be a headache (mostly security-wise) and doing it in
> userspace - watch every mountpoint in /proc/mounts - may be less
> error-prone.

Hmm, so I am all for implementing in userspace and monitoring mounts
would have been much better than monitoring super block.
In fact, our use case is actually recursive directory subtree watch and
super block watch is a compromise, because it requires more filtering
in userspace.
The reason I am aiming for super block watch is actually technical -
all of the directory fsnotify hooks do not have the 'file' or 'path' reference,
only the 'dentry' (and some only have the 'inode'), because they are buried deep
inside vfs helpers. I considered the option of propagating the dir
file reference
into the vfs helpers, but that seems unnecessarily complicated. Especially,
in light of the fact that I believe in the right of all mount namespaces to get
notified of changes in fs visible from their mounts.

Let me state my roadmap, because I realize it is not clear from this WIP:
1. fanotify supports directory events
2. fanotify supports super block watch
3. fanotify optionally reports file handles instead of file descriptors
4. fanotify filters out events on dentries that are not accessible
from the mount
    from which the watch was added

The implication of step #4 above it that the fsnotify infrastructure will report
any event on the super block to the fanotify backend and fanotify will
filter every reported dentry by walking up to root (at least for slow path),
looking for the root dentry of the mount from which the inode mark was added.

For some workloads it will make more sense to do recursive directory watch
the old inotify way and for some workloads and it would make sense to do
a filtered root watch. In either case, vfs performance does not suffer any
degradation outside the scope of the watch (either single dir or sb root).

Does this plan make sense?

Cheers,
Amir.

>
>                                                                 Honza
>
> >
> > Amir Goldstein (7):
> >   fsnotify: pass dentry instead of inode when available
> >   fsnotify: annotate filename events
> >   fanotify: new init flag FAN_EVENT_INFO_PARENT
> >   fanotify: store mount point from which an inode watch was added
> >   fanotify: support events with data type FSNOTIFY_EVENT_DENTRY
> >   fanotify: add support for create/attrib/rename/delete events
> >   fanotify: pass filename info for filename events
> >
> >  fs/notify/fanotify/fanotify.c      | 85 +++++++++++++++++++++++++++++++----
> >  fs/notify/fanotify/fanotify.h      | 24 +++++++++-
> >  fs/notify/fanotify/fanotify_user.c | 92 ++++++++++++++++++++++++++++++++++----
> >  fs/notify/fdinfo.c                 |  4 +-
> >  fs/notify/fsnotify.c               |  2 +-
> >  fs/notify/inode_mark.c             |  1 +
> >  fs/notify/mark.c                   | 15 +++++--
> >  include/linux/fsnotify.h           | 46 ++++++++++++++-----
> >  include/linux/fsnotify_backend.h   | 24 +++++++---
> >  include/uapi/linux/fanotify.h      | 41 ++++++++++++++---
> >  10 files changed, 287 insertions(+), 47 deletions(-)
> >
> > --
> > 2.7.4
> >
> --
> Jan Kara <jack@xxxxxxxx>
> SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html