Re: [PATCH] fsnotify: Rearrange fast path to minimise overhead when there is no watcher

Amir Goldstein <amir73il@xxxxxxxxx> · Mon, 8 Jun 2020 18:03:56 +0300

On Mon, Jun 8, 2020 at 5:05 PM Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> wrote:
>
> The fsnotify paths are trivial to hit even when there are no watchers and
> they are surprisingly expensive. For example, every successful vfs_write()
> hits fsnotify_modify which calls both fsnotify_parent and fsnotify unless
> FMODE_NONOTIFY is set which is an internal flag invisible to userspace.
> As it stands, fsnotify_parent is a guaranteed functional call even if there
> are no watchers and fsnotify() does a substantial amount of unnecessary
> work before it checks if there are any watchers. A perf profile showed
> that applying mnt->mnt_fsnotify_mask in fnotify() was almost half of the
> total samples taken in that function during a test. This patch rearranges
> the fast paths to reduce the amount of work done when there are no watchers.
>
> The test motivating this was "perf bench sched messaging --pipe". Despite
> the fact the pipes are anonymous, fsnotify is still called a lot and
> the overhead is noticable even though it's completely pointless. It's
> likely the overhead is negligible for real IO so this is an extreme
> example. This is a comparison of hackbench using processes and pipes on
> a 1-socket machine with 8 CPU threads without fanotify watchers.
>
>                               5.7.0                  5.7.0
>                             vanilla      fastfsnotify-v1r1
> Amean     1       0.4837 (   0.00%)      0.4630 *   4.27%*
> Amean     3       1.5447 (   0.00%)      1.4557 (   5.76%)
> Amean     5       2.6037 (   0.00%)      2.4363 (   6.43%)
> Amean     7       3.5987 (   0.00%)      3.4757 (   3.42%)
> Amean     12      5.8267 (   0.00%)      5.6983 (   2.20%)
> Amean     18      8.4400 (   0.00%)      8.1327 (   3.64%)
> Amean     24     11.0187 (   0.00%)     10.0290 *   8.98%*
> Amean     30     13.1013 (   0.00%)     12.8510 (   1.91%)
> Amean     32     13.9190 (   0.00%)     13.2410 (   4.87%)
>
>                        5.7.0       5.7.0
>                      vanilla fastfsnotify-v1r1
> Duration User         157.05      152.79
> Duration System      1279.98     1219.32
> Duration Elapsed      182.81      174.52
>
> This is showing that the latencies are improved by roughly 2-9%. The
> variability is not shown but some of these results are within the noise
> as this workload heavily overloads the machine. That said, the system CPU
> usage is reduced by quite a bit so it makes sense to avoid the overhead
> even if it is a bit tricky to detect at times. A perf profile of just 1
> group of tasks showed that 5.14% of samples taken were in either fsnotify()
> or fsnotify_parent(). With the patch, 2.8% of samples were in fsnotify,
> mostly function entry and the initial check for watchers.  The check for
> watchers is complicated enough that inlining it may be controversial.
>
> Signed-off-by: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>

Hi Mel,

Thanks for looking into this

> ---
>  fs/notify/fsnotify.c             | 25 +++++++++++++++----------
>  include/linux/fsnotify.h         | 10 ++++++++++
>  include/linux/fsnotify_backend.h |  4 ++--
>  3 files changed, 27 insertions(+), 12 deletions(-)
>
> diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
> index 72d332ce8e12..de7bbfd973c0 100644
> --- a/fs/notify/fsnotify.c
> +++ b/fs/notify/fsnotify.c
> @@ -143,7 +143,7 @@ void __fsnotify_update_child_dentry_flags(struct inode *inode)
>  }
>
>  /* Notify this dentry's parent about a child's events. */
> -int fsnotify_parent(struct dentry *dentry, __u32 mask, const void *data,
> +int __fsnotify_parent(struct dentry *dentry, __u32 mask, const void *data,
>                     int data_type)
>  {
>         struct dentry *parent;
> @@ -174,7 +174,7 @@ int fsnotify_parent(struct dentry *dentry, __u32 mask, const void *data,
>
>         return ret;
>  }
> -EXPORT_SYMBOL_GPL(fsnotify_parent);
> +EXPORT_SYMBOL_GPL(__fsnotify_parent);
>
>  static int send_to_group(struct inode *to_tell,
>                          __u32 mask, const void *data,
> @@ -315,17 +315,12 @@ int fsnotify(struct inode *to_tell, __u32 mask, const void *data, int data_is,
>         struct fsnotify_iter_info iter_info = {};
>         struct super_block *sb = to_tell->i_sb;
>         struct mount *mnt = NULL;
> -       __u32 mnt_or_sb_mask = sb->s_fsnotify_mask;
> +       __u32 mnt_or_sb_mask;
>         int ret = 0;
> -       __u32 test_mask = (mask & ALL_FSNOTIFY_EVENTS);
> +       __u32 test_mask;
>
> -       if (path) {
> +       if (path)
>                 mnt = real_mount(path->mnt);
> -               mnt_or_sb_mask |= mnt->mnt_fsnotify_mask;
> -       }
> -       /* An event "on child" is not intended for a mount/sb mark */
> -       if (mask & FS_EVENT_ON_CHILD)
> -               mnt_or_sb_mask = 0;
>
>         /*
>          * Optimization: srcu_read_lock() has a memory barrier which can
> @@ -337,11 +332,21 @@ int fsnotify(struct inode *to_tell, __u32 mask, const void *data, int data_is,
>         if (!to_tell->i_fsnotify_marks && !sb->s_fsnotify_marks &&
>             (!mnt || !mnt->mnt_fsnotify_marks))
>                 return 0;
> +
> +       /* An event "on child" is not intended for a mount/sb mark */
> +       mnt_or_sb_mask = 0;
> +       if (!(mask & FS_EVENT_ON_CHILD)) {
> +               mnt_or_sb_mask = sb->s_fsnotify_mask;
> +               if (path)
> +                       mnt_or_sb_mask |= mnt->mnt_fsnotify_mask;
> +       }
> +

Are you sure that loading ->_fsnotify_mask is so much more expensive
than only checking ->_fsnotify_marks? They are surely on the same cache line.
Isn't it possible that you just moved the penalty to ->_fsnotify_marks check
with this change?
Did you test performance with just the fsnotify_parent() change alone?

In any case, this hunk seriously conflicts with a patch set I am working on,
so I would rather not merging this change as is.

If you provide me the feedback that testing ->_fsnotify_marks before loading
->_fsnotify_mask is beneficial on its own, then I will work this change into
my series.

>         /*
>          * if this is a modify event we may need to clear the ignored masks
>          * otherwise return if neither the inode nor the vfsmount/sb care about
>          * this type of event.
>          */
> +       test_mask = (mask & ALL_FSNOTIFY_EVENTS);
>         if (!(mask & FS_MODIFY) &&
>             !(test_mask & (to_tell->i_fsnotify_mask | mnt_or_sb_mask)))
>                 return 0;
> diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
> index 5ab28f6c7d26..508f6bb0b06b 100644
> --- a/include/linux/fsnotify.h
> +++ b/include/linux/fsnotify.h
> @@ -44,6 +44,16 @@ static inline void fsnotify_dirent(struct inode *dir, struct dentry *dentry,
>         fsnotify_name(dir, mask, d_inode(dentry), &dentry->d_name, 0);
>  }
>
> +/* Notify this dentry's parent about a child's events. */
> +static inline int fsnotify_parent(struct dentry *dentry, __u32 mask,
> +                                 const void *data, int data_type)
> +{
> +       if (!(dentry->d_flags & DCACHE_FSNOTIFY_PARENT_WATCHED))
> +               return 0;
> +
> +       return __fsnotify_parent(dentry, mask, data, data_type);
> +}
> +

This change looks good as is, but I'm afraid my series is going to
make it obsolete.
It may very well be that my series will introduce more performance
penalty to your workload.

It would be very much appreciated if you could run a test for me.
I will gladly work in some more optimizations into my series.

You can find the relevant part of my work at:
https://github.com/amir73il/linux/commits/fsnotify_name

What this work does essentially is two things:
1. Call backend once instead of twice when both inode and parent are
    watching.
2. Snapshot name and parent inode to pass to backend not only when
    parent is watching, but also when an sb/mnt mark exists which
    requests to get file names with events.

Compared to the existing implementation of fsnotify_parent(),
my code needs to also test bits in inode->i_fsnotify_mask,
inode->i_sb->s_fsnotify_mask and mnt->mnt_fsnotify_mask
before the fast path can be taken.
So its back to square one w.r.t your optimizations.

I could add the same optimization as in fsnotify() to check
->_fsnotify_marks before ->_fsnotify_mask if you can provide
me some concrete numbers.

Thanks,
Amir.