On Fri, Jul 3, 2020 at 5:03 PM Jan Kara <jack@xxxxxxx> wrote: > > On Fri 12-06-20 12:33:24, Amir Goldstein wrote: > > From: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> > > > > The fsnotify paths are trivial to hit even when there are no watchers and > > they are surprisingly expensive. For example, every successful vfs_write() > > hits fsnotify_modify which calls both fsnotify_parent and fsnotify unless > > FMODE_NONOTIFY is set which is an internal flag invisible to userspace. > > As it stands, fsnotify_parent is a guaranteed functional call even if there > > are no watchers and fsnotify() does a substantial amount of unnecessary > > work before it checks if there are any watchers. A perf profile showed > > that applying mnt->mnt_fsnotify_mask in fnotify() was almost half of the > > total samples taken in that function during a test. This patch rearranges > > the fast paths to reduce the amount of work done when there are no > > watchers. > > > > The test motivating this was "perf bench sched messaging --pipe". Despite > > the fact the pipes are anonymous, fsnotify is still called a lot and > > the overhead is noticeable even though it's completely pointless. It's > > likely the overhead is negligible for real IO so this is an extreme > > example. This is a comparison of hackbench using processes and pipes on > > a 1-socket machine with 8 CPU threads without fanotify watchers. > > > > 5.7.0 5.7.0 > > vanilla fastfsnotify-v1r1 > > Amean 1 0.4837 ( 0.00%) 0.4630 * 4.27%* > > Amean 3 1.5447 ( 0.00%) 1.4557 ( 5.76%) > > Amean 5 2.6037 ( 0.00%) 2.4363 ( 6.43%) > > Amean 7 3.5987 ( 0.00%) 3.4757 ( 3.42%) > > Amean 12 5.8267 ( 0.00%) 5.6983 ( 2.20%) > > Amean 18 8.4400 ( 0.00%) 8.1327 ( 3.64%) > > Amean 24 11.0187 ( 0.00%) 10.0290 * 8.98%* > > Amean 30 13.1013 ( 0.00%) 12.8510 ( 1.91%) > > Amean 32 13.9190 ( 0.00%) 13.2410 ( 4.87%) > > > > 5.7.0 5.7.0 > > vanilla fastfsnotify-v1r1 > > Duration User 157.05 152.79 > > Duration System 1279.98 1219.32 > > Duration Elapsed 182.81 174.52 > > > > This is showing that the latencies are improved by roughly 2-9%. The > > variability is not shown but some of these results are within the noise > > as this workload heavily overloads the machine. That said, the system CPU > > usage is reduced by quite a bit so it makes sense to avoid the overhead > > even if it is a bit tricky to detect at times. A perf profile of just 1 > > group of tasks showed that 5.14% of samples taken were in either fsnotify() > > or fsnotify_parent(). With the patch, 2.8% of samples were in fsnotify, > > mostly function entry and the initial check for watchers. The check for > > watchers is complicated enough that inlining it may be controversial. > > > > [Amir] Slightly simplify with mnt_or_sb_mask => marks_mask > > > > Signed-off-by: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> > > Signed-off-by: Amir Goldstein <amir73il@xxxxxxxxx> > > --- > > fs/notify/fsnotify.c | 27 +++++++++++++++------------ > > include/linux/fsnotify.h | 10 ++++++++++ > > include/linux/fsnotify_backend.h | 4 ++-- > > 3 files changed, 27 insertions(+), 14 deletions(-) > > > > diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c > > index 72d332ce8e12..d59a58d10b84 100644 > > --- a/fs/notify/fsnotify.c > > +++ b/fs/notify/fsnotify.c > > @@ -143,7 +143,7 @@ void __fsnotify_update_child_dentry_flags(struct inode *inode) > > } > > > > /* Notify this dentry's parent about a child's events. */ > > -int fsnotify_parent(struct dentry *dentry, __u32 mask, const void *data, > > +int __fsnotify_parent(struct dentry *dentry, __u32 mask, const void *data, > > int data_type) > > { > > struct dentry *parent; > > Hum, should we actually remove the DCACHE_FSNOTIFY_PARENT_WATCHED check > from here when it's moved to fsnotify_parent() inline helper? No point. It is making a comeback on: fsnotify: send event with parent/name info to sb/mount/non-dir marks > > > @@ -315,17 +315,11 @@ int fsnotify(struct inode *to_tell, __u32 mask, const void *data, int data_is, > > struct fsnotify_iter_info iter_info = {}; > > struct super_block *sb = to_tell->i_sb; > > struct mount *mnt = NULL; > > - __u32 mnt_or_sb_mask = sb->s_fsnotify_mask; > > int ret = 0; > > - __u32 test_mask = (mask & ALL_FSNOTIFY_EVENTS); > > + __u32 test_mask, marks_mask; > > > > - if (path) { > > + if (path) > > mnt = real_mount(path->mnt); > > - mnt_or_sb_mask |= mnt->mnt_fsnotify_mask; > > - } > > - /* An event "on child" is not intended for a mount/sb mark */ > > - if (mask & FS_EVENT_ON_CHILD) > > - mnt_or_sb_mask = 0; > > > > /* > > * Optimization: srcu_read_lock() has a memory barrier which can > > @@ -337,13 +331,22 @@ int fsnotify(struct inode *to_tell, __u32 mask, const void *data, int data_is, > > if (!to_tell->i_fsnotify_marks && !sb->s_fsnotify_marks && > > (!mnt || !mnt->mnt_fsnotify_marks)) > > return 0; > > + > > + /* An event "on child" is not intended for a mount/sb mark */ > > + marks_mask = to_tell->i_fsnotify_mask; > > + if (!(mask & FS_EVENT_ON_CHILD)) { > > + marks_mask |= sb->s_fsnotify_mask; > > + if (mnt) > > + marks_mask |= mnt->mnt_fsnotify_mask; > > + } > > + > > /* > > * if this is a modify event we may need to clear the ignored masks > > * otherwise return if neither the inode nor the vfsmount/sb care about > > * this type of event. > > */ > > - if (!(mask & FS_MODIFY) && > > - !(test_mask & (to_tell->i_fsnotify_mask | mnt_or_sb_mask))) > > + test_mask = (mask & ALL_FSNOTIFY_EVENTS); > > + if (!(mask & FS_MODIFY) && !(test_mask & marks_mask)) > > return 0; > > Otherwise the patch looks good. One observation though: The (mask & > FS_MODIFY) check means that all vfs_write() calls end up going through the > "slower" path iterating all mark types and checking whether there are marks > anyway. That could be relatively simply optimized using a hidden mask flag > like FS_ALWAYS_RECEIVE_MODIFY which would be set when there's some mark > needing special handling of FS_MODIFY... Not sure if we care enough at this > point... Yeh that sounds low hanging. Actually, I Don't think we need to define a flag for that. __fsnotify_recalc_mask() can add FS_MODIFY to the object's mask if needed. I will take a look at that as part of FS_PRE_MODIFY work. But in general, we should fight the urge to optimize theoretic performance issues... Thanks, Amir.