All right, I thought a bit more about that and returned to your original BPF idea you mentioned on 2020-08-28:
I was thinking that we could add a BPF hook to fanotify_handle_event() (similar to what's happening in packet filtering code) and you could attach BPF programs to this hook to do filtering of events. That way we don't have to introduce new group flags for various filtering options. The question is whether eBPF is strong enough so that filters useful for fanotify users could be implemented with it but this particular check seems implementable. Honza
Instead of changing fanotify's filesystem notification functionality, I suggest to rather **add a tracing mode (fantrace)**. The synchronous handling of syscalls via ptrace is of course required for debugging purposes, however that introduces a major slowdown (even with seccomp-bpf filters). There are a number of cases, including [1-3], where async processing of file events of specific tasks would be fine but is not readily available in Linux. Fanotify already ships important infrastructure in this regard: it provides very fast event-buffering and, by using file descriptors instead of resolved paths, a clean and race-free API to process the events later. However, as already stated, fanotify does not provide a clean way, to monitor only a subset of tasks. Therefore please consider the following proposed architecture of fantrace: Each taks gets its own struct fsnotify_group. Within fsnotify.c:fsnotify() it is checked if the given task has a fsnotify_group attached where events of interest are buffered as usual. Note that this is an additional hook - sysadmins being subscribed to filesystem events rather than task-filesystem-events are notified as usual - in that case two hooks possibly run. The fsnotify_group is extended by a field optionally pointing to a BPF program which allows for custom filters to be run. Some implementation details: - To let the tracee return quickly, run BPF filter program within tracer context during read(fan_fd) but before events are copied to userspace - only one fantracer per task, which overrides existing ones if any - task->fsnotify_group refcount increment on fork, decrement on exit (run after exit_files(tsk) to not miss final close events). When last task exited, send EOF to listener. - on exec of seuid-programs the fsnotify_group is cleared (like in ptrace) - lazy check when event occurs, if listener is still alive (refcount > 1) - for the beginning, to keep things simple and to "solve" the cleanup of filesystem marks, I suggest to disable i_fsnotify_marks for fantrace (only allow FAN_MARK_FILESYSTEM), as that functionality can be implemented within the user-provided BPF-program. A working implementation of this concept, which effectively does the same using hardcoded filter rules can be found in my kernel module shournalk [4]. For instance In kernel/event_handler.c:event_handler_fput() it is checked, if the task is observed using a hashtable, and if so, the event is stored to a buffer corresponding to that process tree. Thanks Tycho [1] Chirigati F, Rampin R, Shasha D, Freire J. (2016). ReproZip: Computational Reproducibility with Ease. Paper presented at the Proceedings of the 2016 International Conference on Management of Data. San Francisco, CA: New York: Association for Computing Technology. https://github.com/VIDA-NYU/reprozip [2] Guo, P. (2012). CDE: A Tool For Creating Portable Experimental Software Packages. Computing in Science & Engineering 14, 332–35 [3] Tycho Kirchner, Konstantin Riege, Steve Hoffmann (2020). Bashing irreproducibility with shournal bioRxiv 2020.08.03.232843; doi: https://doi.org/10.1101/2020.08.03.232843 [4] https://github.com/tycho-kirchner/shournal