On Wed, Nov 25, 2020 at 1:01 PM Jan Kara <jack@xxxxxxx> wrote: > > On Tue 24-11-20 16:47:41, Amir Goldstein wrote: > > On Tue, Nov 24, 2020 at 3:49 PM Jan Kara <jack@xxxxxxx> wrote: > > > On Mon 09-11-20 20:00:16, Amir Goldstein wrote: > > > > A filesystem view is a subtree of a filesystem accessible from a specific > > > > mount point. When marking an FS view, user expects to get events on all > > > > inodes that are accessible from the marked mount, even if the events > > > > were generated from another mount. > > > > > > > > In particular, the events such as FAN_CREATE, FAN_MOVE, FAN_DELETE that > > > > are not delivered to a mount mark can be delivered to an FS view mark. > > > > > > > > One example of a filesystem view is btrfs subvolume, which cannot be > > > > marked with a regular filesystem mark. > > > > > > > > Another example of a filesystem view is a bind mount, not on the root of > > > > the filesystem, such as the bind mounts used for containers. > > > > > > > > A filesystem view mark is composed of a heads sb mark and an sb_view mark. > > > > The filesystem view mark is connected to the head sb mark and the head > > > > sb mark is connected to the sb object. The mask of the head sb mask is > > > > a cumulative mask of all the associated sb_view mark masks. > > > > > > > > Filesystem view marks cannot co-exist with a regular filesystem mark on > > > > the same filesystem. > > > > > > > > When an event is generated on the head sb mark, fsnotify iterates the > > > > list of associated sb_view marks and filter events that happen outside > > > > of the sb_view mount's root. > > > > > > > > Signed-off-by: Amir Goldstein <amir73il@xxxxxxxxx> > > > > > > I gave this just a high-level look (no detailed review) and here are my > > > thoughts: > > > > > > 1) I like the functionality. IMO this is what a lot of people really want > > > when looking for "filesystem wide fs monitoring". > > > > > > 2) I don't quite like the API you propose though. IMO it exposes details of > > > implementation in the API. I'd rather like to have API the same as for > > > mount marks but with a dedicated mark type flag in the API - like > > > FAN_MARK_FILESYSTEM_SUBTREE (or we can keep VIEW if you like it but I think > > > the less terms the better ;). > > > > Sure, FAN_MARK_FS_VIEW is a dedicated mark type. > > The fact that is it a bitwise OR of MOUNT and FILESYSTEM is just a fun fact. > > Sorry if that wasn't clear. > > FAN_MARK_FILESYSTEM_SUBTREE sounds better for uapi. > > > > But I suppose you also meant that we should not limit the subtree root > > to bind mount points? > > > > The reason I used a reference to mnt for a sb_view and not dentry > > is because we have fsnotify_clear_marks_by_mount() callback to > > handle cleanup of the sb_view marks (which I left as TODO). > > > > Alternatively, we can play cache pinning games with the subtree root dentry > > like the case with inode mark, but I didn't want to get into that nor did I know > > if we should - if subtree mark requires CAP_SYS_ADMIN anyway, why not > > require a bind mount as its target, which is something much more visible to > > admins. > > Yeah, I don't have problems with bind mounts in particular. Just I was > thinking that concievably we could make these marks less priviledged (just > with CAP_DAC_SEARCH or so) and then mountpoints may be unnecessarily > restricting. I don't think pinning of subtree root dentry would be > problematic as such - inode marks pin the inode anyway, this is not > substantially different - if we can make it work reliably... > I'm going to need your help for that ;-) > In fact I was considering for a while that we could even make subtree > watches completely unpriviledged - when we walk the dir tree anyway, we > could also check permissions along the way. Due to locking this would be > difficult to do when generating the event but it might be actually doable > if we perform the permission check when reporting the event to userspace. > Just a food for thought... Maybe... but there are some other advantages to restricting to mount. One is that with btrfs subvolumes conn->fsid can actually cache the subvolume's fsid and we could remove the restriction of -EXDEV error of FAN_MARK_FILESYSTEM on subvolume. Another has to do with performance optimization (see below). > > > > Also I think this is going to get expensive > > > (e.g. imagine each write to page cache having to traverse potentially deep > > > tree hierarchy potentially multiple times - once for each subtree). My > > > suspicion should be verified with actual performance measurement but if I'm > > > right and the overhead is indeed noticeable, I think we'll need to employ > > > some caching or so to avoid repeated lookups... > > > > It's true, but here is a different angle to analyse the overhead - claim: > > "If users don't have kernel subtree mark, they will use filesystem mark > > and filter is userspace". If that claim is true, than one can argue that > > this is fair - let the listener process pay the CPU overhead which can be > > contained inside a cgroup and not everyone else. But what is the cost that > > everyone else will be paying in that case? > > Everyone will still pay the cost of the fanotify backend callback including > > allocate, pack and queue the event. > > The question then becomes, what is cheaper? The d_ancestor() traversal > > or all the fanotify backend handler code? > > Note that the former can be lockless and the latter does take locks. > > I understand and it's a fair point that queueing of the event is not cheap > either so I'd be interested in the numbers. Something like how deep subtree > walk is similar to a cost of queueing event. Or similarly checking of how many > subtree watches is similarly costly as queueing of one event? > The cost shouldn't actually be sensitive to the number of subtree watches. We should never do more than a single ancestor traversal per event up to it's sb root and for each ancestor we should be able to check with O(1) if a subtree watch exists on that inode/dentry. If we stay with the bind mount restriction then we can use the check d_mountpoint() on every ancestor which practically reduces the cost per ancestor to zero in most cases. > > I have a pretty good bet on the answer, but as you say only actual performance > > benchmarks can tell. > > > > From my experience, real life fsevent listeners do not listen on FAN_MODIFY > > but they listen on FAN_CLOSE_WRITE, because the the former is too noisy. > > Agreed. > > > The best case scenario is that users step forward to say that they want to > > use fanotify but need the subtree filterting and can provide us with real life > > performance measurement of the userspace vs. kernel alternatives (in terms > > of penalty on the rest of the system). > > With the cost of having to go to userspace and there do essentially the same > subtree walk as you do in the kernel, I have no doubt what's going to be > faster (by orders of magnitude). What I'm somewhat uncertain is whether the > subtree check is OK at the time of event generation or whether it should > better be moved to the moment when we are about to report the event to > userspace (when the cost of the subtree check goes to the process > interested in the event which is fine - but as you properly note we already > had to pay the cost of queueing the event so it isn't clear this is a win > even for the processes generating events). > I think the only semantics that make sense are: * If all object's present and past ancestors were always under subtree - guaranty to get an event * If all object's present and past ancestors were never under subtree - guaranty to not get an event * Otherwise - undefined So I think it is ok to check for subtree on event generation time. Thanks, Amir. Thanks, Amir.