Re: [PATCH] namei: clear nd->root.mnt before O_CREAT unlazy

Ian Kent <raven@xxxxxxxxxx> · Fri, 07 Jan 2022 07:46:10 +0800

On Wed, 2022-01-05 at 13:02 -0500, Brian Foster wrote:
> The unlazy sequence of an rcuwalk lookup occurs a bit earlier than
> normal for O_CREAT lookups (i.e. in open_last_lookups()). The create
> logic here historically invoked complete_walk(), which clears the
> nd->root.mnt pointer when appropriate before the unlazy.  This
> changed in commit 72287417abd1 ("open_last_lookups(): don't abuse
> complete_walk() when all we want is unlazy"), which refactored the
> create path to invoke unlazy_walk() and not consider nd->root.mnt.
> 
> This tweak negatively impacts performance on a concurrent
> open(O_CREAT) workload to multiple independent mounts beneath the
> root directory. This attributes to increased spinlock contention on
> the root dentry via legitimize_root(), to the point where the
> spinlock becomes the primary bottleneck over the directory inode
> rwsem of the individual submounts. For example, the completion rate
> of a 32k thread aim7 create/close benchmark that repeatedly passes
> O_CREAT to open preexisting files drops from over 700k "jobs per
> minute" to 30, increasing the overall test time from a few minutes
> to over an hour.
> 
> A similar, more simplified test to create a set of opener tasks
> across a set of submounts can demonstrate the problem more quickly.
> For example, consider sets of 100 open/close tasks each running
> against 64 independent filesystem mounts (i.e. 6400 tasks total),
> with each task completing 10k iterations before it exits. On an
> 80xcpu box running v5.16.0-rc2, this test completes in 50-55s. With
> this patch applied, the same test completes in 10-15s.
> 
> This is not the most realistic workload in the world as it factors
> out inode allocation in the filesystem. The contention can also be
> avoided by more selective use of O_CREAT or via use of relative
> pathnames. That said, this regression appears to be an unintentional
> side effect of code cleanup and might be unexpected for users.
> Restore original behavior prior to commit 72287417abd1 by factoring
> the rcu logic from complete_walk() into a new helper and invoke that
> from both places.
> 
> Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx>
> ---
>  fs/namei.c | 37 +++++++++++++++++++++----------------
>  1 file changed, 21 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/namei.c b/fs/namei.c
> index 1f9d2187c765..b32fcbc99929 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -856,6 +856,22 @@ static inline int d_revalidate(struct dentry
> *dentry, unsigned int flags)
>                 return 1;
>  }
>  
> +static inline bool complete_walk_rcu(struct nameidata *nd)
> +{
> +       if (nd->flags & LOOKUP_RCU) {
> +               /*
> +                * We don't want to zero nd->root for scoped-lookups
> or
> +                * externally-managed nd->root.
> +                */
> +               if (!(nd->state & ND_ROOT_PRESET))
> +                       if (!(nd->flags & LOOKUP_IS_SCOPED))
> +                               nd->root.mnt = NULL;
> +               nd->flags &= ~LOOKUP_CACHED;
> +               return try_to_unlazy(nd);
> +       }
> +       return true;
> +}
> +
>  /**
>   * complete_walk - successful completion of path walk
>   * @nd:  pointer nameidata
> @@ -871,18 +887,8 @@ static int complete_walk(struct nameidata *nd)
>         struct dentry *dentry = nd->path.dentry;
>         int status;
>  
> -       if (nd->flags & LOOKUP_RCU) {
> -               /*
> -                * We don't want to zero nd->root for scoped-lookups
> or
> -                * externally-managed nd->root.
> -                */
> -               if (!(nd->state & ND_ROOT_PRESET))
> -                       if (!(nd->flags & LOOKUP_IS_SCOPED))
> -                               nd->root.mnt = NULL;
> -               nd->flags &= ~LOOKUP_CACHED;
> -               if (!try_to_unlazy(nd))
> -                       return -ECHILD;
> -       }
> +       if (!complete_walk_rcu(nd))
> +               return -ECHILD;
>  
>         if (unlikely(nd->flags & LOOKUP_IS_SCOPED)) {
>                 /*
> @@ -3325,10 +3331,9 @@ static const char *open_last_lookups(struct
> nameidata *nd,
>                 BUG_ON(nd->flags & LOOKUP_RCU);
>         } else {
>                 /* create side of things */
> -               if (nd->flags & LOOKUP_RCU) {
> -                       if (!try_to_unlazy(nd))
> -                               return ERR_PTR(-ECHILD);
> -               }
> +               if (!complete_walk_rcu(nd))
> +                       return ERR_PTR(-ECHILD);
> +
>                 audit_inode(nd->name, dir, AUDIT_INODE_PARENT);
>                 /* trailing slashes? */
>                 if (unlikely(nd->last.name[nd->last.len]))

Looks good, assuming Al is ok with the re-factoring.
Reviewed-by: Ian Kent <raven@xxxxxxxxxx>

Ian