On Wed, 2022-01-05 at 13:02 -0500, Brian Foster wrote: > The unlazy sequence of an rcuwalk lookup occurs a bit earlier than > normal for O_CREAT lookups (i.e. in open_last_lookups()). The create > logic here historically invoked complete_walk(), which clears the > nd->root.mnt pointer when appropriate before the unlazy. This > changed in commit 72287417abd1 ("open_last_lookups(): don't abuse > complete_walk() when all we want is unlazy"), which refactored the > create path to invoke unlazy_walk() and not consider nd->root.mnt. > > This tweak negatively impacts performance on a concurrent > open(O_CREAT) workload to multiple independent mounts beneath the > root directory. This attributes to increased spinlock contention on > the root dentry via legitimize_root(), to the point where the > spinlock becomes the primary bottleneck over the directory inode > rwsem of the individual submounts. For example, the completion rate > of a 32k thread aim7 create/close benchmark that repeatedly passes > O_CREAT to open preexisting files drops from over 700k "jobs per > minute" to 30, increasing the overall test time from a few minutes > to over an hour. > > A similar, more simplified test to create a set of opener tasks > across a set of submounts can demonstrate the problem more quickly. > For example, consider sets of 100 open/close tasks each running > against 64 independent filesystem mounts (i.e. 6400 tasks total), > with each task completing 10k iterations before it exits. On an > 80xcpu box running v5.16.0-rc2, this test completes in 50-55s. With > this patch applied, the same test completes in 10-15s. > > This is not the most realistic workload in the world as it factors > out inode allocation in the filesystem. The contention can also be > avoided by more selective use of O_CREAT or via use of relative > pathnames. That said, this regression appears to be an unintentional > side effect of code cleanup and might be unexpected for users. > Restore original behavior prior to commit 72287417abd1 by factoring > the rcu logic from complete_walk() into a new helper and invoke that > from both places. > > Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx> > --- > fs/namei.c | 37 +++++++++++++++++++++---------------- > 1 file changed, 21 insertions(+), 16 deletions(-) > > diff --git a/fs/namei.c b/fs/namei.c > index 1f9d2187c765..b32fcbc99929 100644 > --- a/fs/namei.c > +++ b/fs/namei.c > @@ -856,6 +856,22 @@ static inline int d_revalidate(struct dentry > *dentry, unsigned int flags) > return 1; > } > > +static inline bool complete_walk_rcu(struct nameidata *nd) > +{ > + if (nd->flags & LOOKUP_RCU) { > + /* > + * We don't want to zero nd->root for scoped-lookups > or > + * externally-managed nd->root. > + */ > + if (!(nd->state & ND_ROOT_PRESET)) > + if (!(nd->flags & LOOKUP_IS_SCOPED)) > + nd->root.mnt = NULL; > + nd->flags &= ~LOOKUP_CACHED; > + return try_to_unlazy(nd); > + } > + return true; > +} > + > /** > * complete_walk - successful completion of path walk > * @nd: pointer nameidata > @@ -871,18 +887,8 @@ static int complete_walk(struct nameidata *nd) > struct dentry *dentry = nd->path.dentry; > int status; > > - if (nd->flags & LOOKUP_RCU) { > - /* > - * We don't want to zero nd->root for scoped-lookups > or > - * externally-managed nd->root. > - */ > - if (!(nd->state & ND_ROOT_PRESET)) > - if (!(nd->flags & LOOKUP_IS_SCOPED)) > - nd->root.mnt = NULL; > - nd->flags &= ~LOOKUP_CACHED; > - if (!try_to_unlazy(nd)) > - return -ECHILD; > - } > + if (!complete_walk_rcu(nd)) > + return -ECHILD; > > if (unlikely(nd->flags & LOOKUP_IS_SCOPED)) { > /* > @@ -3325,10 +3331,9 @@ static const char *open_last_lookups(struct > nameidata *nd, > BUG_ON(nd->flags & LOOKUP_RCU); > } else { > /* create side of things */ > - if (nd->flags & LOOKUP_RCU) { > - if (!try_to_unlazy(nd)) > - return ERR_PTR(-ECHILD); > - } > + if (!complete_walk_rcu(nd)) > + return ERR_PTR(-ECHILD); > + > audit_inode(nd->name, dir, AUDIT_INODE_PARENT); > /* trailing slashes? */ > if (unlikely(nd->last.name[nd->last.len])) Looks good, assuming Al is ok with the re-factoring. Reviewed-by: Ian Kent <raven@xxxxxxxxxx> Ian