On Sun, Jan 16, 2022 at 3:20 AM Ivan Delalande <colona@xxxxxxxxxx> wrote: > > On Sat, Jan 15, 2022 at 09:50:20PM +0200, Amir Goldstein wrote: > > On Sat, Jan 15, 2022 at 5:11 AM Ivan Delalande <colona@xxxxxxxxxx> wrote: > >> Sorry to bring this up so late but we might have found a regression > >> introduced by your "Sort out fsnotify_nameremove() mess" patch series > >> merged in 5.3 (116b9731ad76..7377f5bec133), and that can still be > >> reproduced on v5.16. > >> > >> Some of our processes use inotify to watch for IN_DELETE events (for > >> files on tmpfs mostly), and relied on the fact that once such events are > >> received, the files they refer to have actually been unlinked and can't > >> be open/read. So if and once open() succeeds then it is a new version of > >> the file that has been recreated with new content. > >> > >> This was true and working reliably before 5.3, but changed after > >> 49246466a989 ("fsnotify: move fsnotify_nameremove() hook out of > >> d_delete()") specifically. There is now a time window where a process > >> receiving one of those IN_DELETE events may still be able to open the > >> file and read its old content before it's really unlinked from the FS. > > > > This is a bit surprising to me. > > Do you have a reproducer? > > Yeah, I was using the following one to bisect this. It will print a > message every time it succeeds to read the file after receiving a > IN_DELETE event when run with something like `mkdir /tmp/foo; > ./indelchurn /tmp/foo`. It seems to hit pretty frequently and reliably > on various systems after 5.3, even for different #define-parameters. > I see yes, it's a race between fsnotify_unlink() and d_delete() fsnotify_unlink() in explicitly required to be called before d_delete(), so it has the d_inode information and that leaves a windows for opening the file from cached dentry before d_delete(). I would rather that we try to address this not as a regression until there is proof of more users that expect the behavior you mentioned. I would like to provide you an API to opt-in for this behavior, because fixing it for everyone may cause other workloads to break. Please test the attached patch on top of v5.16 and use IN_DELETE|IN_EXCL_UNLINK as the watch mask for testing. I am assuming that it would be possible for you to modify the application and add the IN_EXCL_UNLINK flag and that your application does not care about getting IN_OPEN events on unlinked files? My patch overloads the existing flag IN_EXCL_UNLINK with a new meaning. It's a bit of a hack and we can use some other flag if we need to but it actually makes some sense that an application that does not care for events on d_unlinked() files will be guaranteed to not get those events after getting an IN_DELETE event. It is another form of the race that you described. Will that solution work out for you? > >> I'm not very familiar with the VFS and fsnotify internals, would you > >> consider this a regression, or was there never any intentional guarantee > >> for that behavior and it's best we work around this change in userspace? > > > > It may be a regression if applications depend on behavior that changed, > > but if are open for changes in your application please describe in more details > > what it tries to accomplish using IN_DELETE and the ekernel your application > > is running on and then I may be able to recommend a more reliable method. > > I can discuss it with our team and get more details on this but it may > be pretty complicated and costly to change. My understanding is that > these watched files are used as ID/version references for in-memory > objects shared between multiple processes to synchronize state, and > resynchronize when there are crashes or restarts. So they can be > recreated in place with the same or a different content, and so their > inode number or mtime etc. may not be useable as additonnal check. > > I think we can also have a very large number of these objects and files > on some configurations, so waiting to see if we have a following > IN_CREATE, or adding more checks/synchronization logic will probably > have a significant impact at scale. > > And we're targeting 5.10 and building it with our own config. > I am not convinced that this is needed as a regression fix for stable kernels, but since you are building your own kernel I can help you do the backport - attached *untested* patch for v5.10 that also backport of "fsnotify: pass dentry instead of inode data". Thanks, Amir.
From 73c137d6a4cab445dec76c731b2c763d45af068e Mon Sep 17 00:00:00 2001 From: Amir Goldstein <amir73il@xxxxxxxxx> Date: Thu, 4 Feb 2021 09:11:11 +0200 Subject: [PATCH] inotify: optionally invalidate dcache before IN_DELETE event Define a new data type to pass for event - FSNOTIFY_EVENT_DENTRY. Use it to pass the dentry instead of d_inode, to allow d_drop() before IN_DELETE event. Signed-off-by: Amir Goldstein <amir73il@xxxxxxxxx> --- fs/notify/fsnotify.c | 2 ++ include/linux/fsnotify.h | 4 ++-- include/linux/fsnotify_backend.h | 16 ++++++++++++++++ 3 files changed, 20 insertions(+), 2 deletions(-) diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c index 8d3ad5ef2925..55fb96c44449 100644 --- a/fs/notify/fsnotify.c +++ b/fs/notify/fsnotify.c @@ -260,6 +260,8 @@ static int fsnotify_handle_event(struct fsnotify_group *group, __u32 mask, child_mark = NULL; dir = NULL; name = NULL; + } else if (inode_mark->mask & FS_EXCL_UNLINK && mask & FS_DELETE) { + d_drop(fsnotify_data_dentry(data, data_type)); } ret = ops->handle_inode_event(inode_mark, mask, inode, dir, name); diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h index f8acddcf54fb..aa35f261cc3d 100644 --- a/include/linux/fsnotify.h +++ b/include/linux/fsnotify.h @@ -36,7 +36,7 @@ static inline void fsnotify_name(struct inode *dir, __u32 mask, static inline void fsnotify_dirent(struct inode *dir, struct dentry *dentry, __u32 mask) { - fsnotify_name(dir, mask, d_inode(dentry), &dentry->d_name, 0); + fsnotify(mask, dentry, FSNOTIFY_EVENT_DENTRY, dir, &dentry->d_name, NULL, 0); } static inline void fsnotify_inode(struct inode *inode, __u32 mask) @@ -77,7 +77,7 @@ static inline int fsnotify_parent(struct dentry *dentry, __u32 mask, */ static inline void fsnotify_dentry(struct dentry *dentry, __u32 mask) { - fsnotify_parent(dentry, mask, d_inode(dentry), FSNOTIFY_EVENT_INODE); + fsnotify_parent(dentry, mask, dentry, FSNOTIFY_EVENT_DENTRY); } static inline int fsnotify_file(struct file *file, __u32 mask) diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h index f8529a3a2923..c67d0d74254e 100644 --- a/include/linux/fsnotify_backend.h +++ b/include/linux/fsnotify_backend.h @@ -250,6 +250,7 @@ enum fsnotify_data_type { FSNOTIFY_EVENT_NONE, FSNOTIFY_EVENT_PATH, FSNOTIFY_EVENT_INODE, + FSNOTIFY_EVENT_DENTRY, }; static inline struct inode *fsnotify_data_inode(const void *data, int data_type) @@ -257,6 +258,8 @@ static inline struct inode *fsnotify_data_inode(const void *data, int data_type) switch (data_type) { case FSNOTIFY_EVENT_INODE: return (struct inode *)data; + case FSNOTIFY_EVENT_DENTRY: + return d_inode(data); case FSNOTIFY_EVENT_PATH: return d_inode(((const struct path *)data)->dentry); default: @@ -264,6 +267,19 @@ static inline struct inode *fsnotify_data_inode(const void *data, int data_type) } } +static inline struct dentry *fsnotify_data_dentry(const void *data, int data_type) +{ + switch (data_type) { + case FSNOTIFY_EVENT_DENTRY: + /* Non const is needed for dget() */ + return (struct dentry *)data; + case FSNOTIFY_EVENT_PATH: + return ((const struct path *)data)->dentry; + default: + return NULL; + } +} + static inline const struct path *fsnotify_data_path(const void *data, int data_type) { -- 2.16.5
From d4d91d5b46c13def48b36bb7734b5fca8e1c2127 Mon Sep 17 00:00:00 2001 From: Amir Goldstein <amir73il@xxxxxxxxx> Date: Sun, 16 Jan 2022 10:11:45 +0200 Subject: [PATCH] inotify: optionally invalidate dcache before IN_DELETE event Apparently, there are some applications that use IN_DELETE event as an invalidation mechanism and expect that if they try to open a file with the name reported with the delete event, that it should not contain the content of the deleted file. Commit 49246466a989 ("fsnotify: move fsnotify_nameremove() hook out of d_delete()") moved the fsnotify delete hook before d_delete() intentionally, so fsnotify will have access to a positive dentry. This allowed a race where opening the deleted file via cached dentry is now possible after receiving the IN_DELETE event. In order to provide the "invalidate" functionality without changing behavior for other workload, overload the original meaning of the IN_EXCL_UNLINK flag with the additional meaning that the file should not be accessed via that name after receiving an IN_DELETE event and drop dentry from cache before IN_DELETE is reported. This will allow applications to opt-in for the "invalidate" behavior. Link: https://lore.kernel.org/linux-fsdevel/YeNyzoDM5hP5LtGW@visor/ Signed-off-by: Amir Goldstein <amir73il@xxxxxxxxx> --- fs/notify/fsnotify.c | 18 +++++++++++++++--- 1 file changed, 15 insertions(+), 3 deletions(-) diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c index 4034ca566f95..86337220f25c 100644 --- a/fs/notify/fsnotify.c +++ b/fs/notify/fsnotify.c @@ -255,9 +255,21 @@ static int fsnotify_handle_inode_event(struct fsnotify_group *group, if (WARN_ON_ONCE(!inode && !dir)) return 0; - if ((inode_mark->mask & FS_EXCL_UNLINK) && - path && d_unlinked(path->dentry)) - return 0; + if (inode_mark->mask & FS_EXCL_UNLINK) { + if (path && d_unlinked(path->dentry)) + return 0; + + /* + * File may be unlinked in filesystem, but still accesible via + * dcache, so user may be able to observe and open the file + * after receiving an IN_DELETE event. + * Overload the original meaning of the IN_EXCL_UNLINK flag + * with the additional meaning that the file cannot be accessed + * via that name after receiving an IN_DELETE event. + */ + if (mask & FS_DELETE) + d_drop(fsnotify_data_dentry(data, data_type)); + } /* Check interest of this mark in case event was sent with two marks */ if (!(mask & inode_mark->mask & ALL_FSNOTIFY_EVENTS)) -- 2.34.1