[cc: linux-api] On Wed, Dec 11, 2019 at 3:58 PM Amir Goldstein <amir73il@xxxxxxxxx> wrote: > > On Wed, Dec 11, 2019 at 12:06 PM Jan Kara <jack@xxxxxxx> wrote: > > > > On Wed 04-12-19 22:27:31, Amir Goldstein wrote: > [...] > > > The way to frame this correctly IMO is that fsnotify events let application > > > know that "something has changed", without any ordering guaranty > > > beyond "sometime before the event was read". > > > > > > So far, that "something" can be a file (by fd), an inode (by fid), > > > more specifically a directory inode (by fid) where in an entry has > > > changed. > > > > > > Adding filename info extends that concept to "something has changed > > > in the namespace at" (by parent fid+name). > > > All it means is that application should pay attention to that part of > > > the namespace and perform a lookup to find out what has changed. > > > > > > Maybe the way to mitigate wrong assumptions about ordering and > > > existence of the filename in the namespace is to omit the event type > > > for "filename events", for example: { FAN_CHANGE, pfid, name }. > > > > So this event would effectively mean: In directory pfid, some filename > > event has happened with name "name" - i.e. "name" was created (could mean > > also mkdir), deleted, moved. Am I right? > > Exactly. > > > And the application would then > > open_by_handle(2) + open_at(2) + fstat(2) the object pointed to by > > open_by_handle(2) + fstatat(2) to be exact. > > > (pfid, name) pair and copy whatever it finds to the other end (or delete on > > the other end in case of ENOENT)? > > Basically, yes. > Although a modern sync tool may also keep some local map of > remote name -> local fid, to detect a local rename and try to perform a > remote rename. > > > > > After some thought, yes, I think this is difficult to misuse (or infer some > > false guarantees out of it). As far as I was thinking it also seems good > > enough to implement more efficient syncing of directories. > > Great, so I will work on the patches. > Hi Jan, I have something working. Patches: https://github.com/amir73il/linux/commits/fanotify_name Simple test: https://github.com/amir73il/ltp/commits/fanotify_name I will post the patches after I have a working demo, but in the mean while here is the gist of the API from the commit log in case you or anyone has comments on the API. Note that in the new event flavor, event mask is given as input (e.g. FAN_CREATE) to filter the type of reported events, but the event types are hidden when event is reported. Besides the dirent event types, events "on child" (i.e. MODIFY) can also be reported with name to a directory watcher. For now, "on child" events cannot be requested for filesystem/mount watch, but I think we should consider this possibility so I added a check to return EINVAL if this combination is attempted. Let me know what you think. Thanks, Amir. commit 91e0af27ac329f279167e74761fb5303ebbc1c08 Author: Amir Goldstein <amir73il@xxxxxxxxx> Date: Mon Dec 16 08:39:21 2019 +0200 fanotify: report name info with FAN_REPORT_FID_NAME With init flags FAN_REPORT_FID_NAME, report events with name in variable length fanotify_event_info record similar to how fid's are reported. When events are reported with name, the reported fid identifies the directory and the name follows the fid. The info record type for this event info is FAN_EVENT_INFO_TYPE_FID_NAME. There are several ways that an application can use this information: 1. When watching a single directory, the name is always relative to the watched directory, so application need to fstatat(2) the name relative to the watched directory. 2. When watching a set of directories, the application could keep a map of dirfd for all watched directories and hash the map by fid obtained with name_to_handle_at(2). When getting a name event, the fid in the event info could be used to lookup the base dirfd in the map and then call fstatat(2) with that dirfd. 3. When watching a filesystem (FAN_MARK_FILESYSTEM) or a large set of directories, the application could use open_by_handle_at(2) with the fid in event info to obtain dirfd for the directory where event happened and call fstatat(2) with this dirfd. The last option scales better for a large number of watched directories. The first two options may be available in the future also for non privileged fanotify watchers, because open_by_handle_at(2) requires the CAP_DAC_READ_SEARCH capability. Legacy inotify events are reported with name and event mask (e.g. "foo", FAN_CREATE | FAN_ONDIR). That can lead users to the conclusion that there is *currently* an entry "foo" that is a sub-directory, when in fact "foo" may be negative or non-dir by the time user gets the event. To make it clear that the current state of the named entry is unknown, the new fanotify event intentionally hides this information and reports only the flag FAN_WITH_NAME in event mask. This should make it harder for users to make wrong assumptions and write buggy applications. We reserve the combination of FAN_EVENT_ON_CHILD on a filesystem/mount mark and FAN_REPORT_NAME group for future use, so for now this combination is invalid. Signed-off-by: Amir Goldstein <amir73il@xxxxxxxxx> commit 76a509dbc06fd58ec6636484f87896044cd99022 Author: Amir Goldstein <amir73il@xxxxxxxxx> Date: Fri Dec 13 11:58:02 2019 +0200 fanotify: implement basic FAN_REPORT_FID_NAME logic Dirent events will be reported in one of two flavors depending on fanotify init flags: 1. Dir fid info + mask that includes the specific event types and optional FAN_ONDIR flag. 2. Dir fid info + name + mask that includes only FAN_WITH_NAME flag. To request the second event flavor, user will need to set the FAN_REPORT_FID_NAME flags in fanotify_init(). The first flavor is already supported since kernel v5.1 and is intended to be used for watching directories in "batch mode" - user is notified when directory is changed and re-scans the directory content in response. This event flavor is stored more compactly in event queue, so it is optimal for workloads with frequent directory changes (e.g. many files created/deleted). The second event flavor is intended to be used for watching large directories, where the cost of re-scan of the directory on every change is considered too high. The watcher getting the event with the directory fid and entry name is expected to call fstatat(2) to query the content of the entry after the change. Events "on child" will behave similarly to dirent events, with a small difference - the first event flavor without name reports the child fid. The second flavor with name info reports the parent fid, because the name is relative to the parent directory. At the moment, event name info reporting is not implemented, so the FAN_REPORT_NAME flag is not yet valid as input to fanotify_init(). Signed-off-by: Amir Goldstein <amir73il@xxxxxxxxx>