I plan to start sending patches for fanotify in the next week or two. I'd like to see more comments on the design, interface, and capabilities in case there is a recognized need for major reworks or if I'm not meeting some users needs (other than those noted at the end) git://git.infradead.org/users/eparis/notify.git fanotify-experimental should have working code to test what I'm talking about. What is fanotify? It is a new notification system that has a limited set of events (open, close, read, write) in which notification not only comes with metadata the describes what happened it also comes with an open file descriptor to the object in question. fanotify will also allow the listener to make access decisions on open and read events. This allows the implementation of hierarchical storage management systems or an access file scanning or integrity checking. fanotify comes in two flavors 'directed' and 'global.' 'Directed' is like inotify or dnotify in that you register specific inodes of interest and only get events pertaining to those inodes. Global means you are registering interest for event types system wide. With global mode the listener program can later exclude objects from future events. fanotify kernel/userspace interaction is over a new socket protocol. A listener opens a new socket in the new PF_FANOTIFY family. The socket is then bound to an address. Using the following struct: struct fanotify_addr { sa_family_t family; __u32 priority; __u32 group_num; __u32 mask; __u32 f_flags; __u32 unused[16]; } __attribute__((packed)); The priority field indicates in which order fanotify listeners will get events. Since 2 fanotify listeners would 'hear' each others events on the new fd they create fanotify listeners will not hear events generated by other fanotify listeners with a lower priority number. The group_num is at the moment not used, but the plan was to allow 2 processes to bind to the same fanotify group and share the load of processing events. The f_flags is the flags which the fanotify listener wishes to use when opening their notification fds. On access scanners would want to use O_RDONLY, whereas HSM systems would need to use O_WRONLY. The mask is the indication of the events this group is interested in. The set of events of interest if FAN_GLOBAL_LISTENER is set at bind time. If FAN_GLOBAL_LISTENER is not set, this field is meaningless as the registration of events on individual inodes will dictate the reception of events. * FAN_ACCESS: every file access. * FAN_MODIFY: file modifications. * FAN_CLOSE: files are closed. * FAN_OPEN: open() calls. * FAN_ACCESS_PERM: like FAN_ACCESS, except that the process trying to access the file is put on hold while the fanotify client decides whether to allow the operation. * FAN_OPEN_PERM: like FAN_OPEN, but with the permission check. * FAN_EVENT_ON_CHILD: receive notification of events on inodes inside this subdirectory. (this is not a full recursive notification of all descendants, only direct children) * FAN_GLOBAL_LISTENER: notify for events on all files in the system. * FAN_SURVIVE_MODIFY: special flag that ignores should survive inode modification. Discussed below. After the socket is bound events are attained using the read() syscall (recv* probably also works haven't tested). This will result in the buffer being filled with one or more events like this: struct fanotify_event_metadata { __u32 event_len; __s32 fd; __u32 mask; __u32 f_flags; __s32 pid; __s32 tgid; __u64 cookie; } __attribute__((packed)); fd specifies the new file descriptor that was created in the context of the listener. (readlink of /proc/self/fd will give you A pathname) mask indicates the events type (bitwise OR of the event types listed above). f_flags here is the f_flags the ORIGINAL process has the file open with. pid and tgid are from the original process. cookie is used when the listener needs to allow, deny, or delay the operation. If a FAN_ACCESS_PERM or FAN_OPEN_PERM event is received the listener must send a response before the 5 second timeout. If no response is sent before the 5 second timeout the original operation is allowed. If this happens too many times (10 in a row) the fanotify group is evicted from the kernel and will not get any new events. Sending a response is done using the setsockopt() call with the socket options set to FANOTIFY_ACCESS_RESPONSE. The buffer should contain a structure like: struct fanotify_so_access { __u64 cookie; __u32 response; } __attribute__((packed)); Where cookie is the cookie from the notification and response is one of: FAN_ALLOW: allow the original operation FAN_DENY: deny the original operation FAN_RESET_TIMEOUT: reset the timeout. The last main interface is the 'marking' of inodes. The purpose of inode marks differ between 'directed' and 'global' listeners. Directed fanotify listeners need to mark inodes of interest. They do that also using setsockopt() of type FANOTIFY_SET_MARK with the buffer containing a structure like: struct fanotify_so_inode_mark { __s32 fd; __u32 mask; __u32 ignored_mask; } __attribute__((packed)); Where fd is backed by the inode in question. Mask is the events of interest (only used in directed mode) and ignored_mask is the mask of events which should be ignored. The ignored_mask is cleared every time an inode receives a modification events unless FAN_SURVIVE_MODIFY is also set. The ignored_mask is mainly used for 2 purposes. Global listeners may just have no interest in lots of events, so they should spam inodes with an ignored mask. The ignored mask is also used to 'cache' access decisions. If the listener sets FAN_ACCESS_PERM in the ignored mask all access operations will be permitted without the call out to userspace. If the inode is modified the ignored_mask will be cleared and userspace will again have to approve the access. If userspace REALLY doesn't care ever they can use the special FAN_SURVIVE_MODIFY flag inside the ignored_mask. The only other current interface is the ability to ignore events by superblock magic number. This makes it easy to ignore all events in /proc which can be difficult to accomplish firing FANOTIFY_SET_MARK with ignored_masks over and over as processes are created and destroyed. *********** Future direction: There are 2 things I'm interested in adding. - Rename events. The updatedb/mlocate people are interested in fanotify as a means to not thrash the harddrive every night. They could instead update the db in real time as files are moved. - subtree notification. Currently to only watch /home and all of it's descendants one must either register a directed watch on every directory or use a global listener. The global listener with ignored_mask is not as bad as it sounds in my testing, but decent subtree registration and notification would be a big win in a lot of people's mind. *********** Please, complaints? sortcomings? design flaws? issues? failures? How can it be tweaked to suit your needs? -Eric -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html