On Tue, Aug 6, 2024 at 4:32 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote: > > Today, when opening a file we'll typically do a fast lookup, but if > O_CREAT is set, the kernel always takes the exclusive inode lock. I > assume this was done with the expectation that O_CREAT means that we > always expect to do the create, but that's often not the case. Many > programs set O_CREAT even in scenarios where the file already exists. > > This patch rearranges the pathwalk-for-open code to also attempt a > fast_lookup in certain O_CREAT cases. If a positive dentry is found, the > inode_lock can be avoided altogether, and if auditing isn't enabled, it > can stay in rcuwalk mode for the last step_into. > > One notable exception that is hopefully temporary: if we're doing an > rcuwalk and auditing is enabled, skip the lookup_fast. Legitimizing the > dentry in that case is more expensive than taking the i_rwsem for now. > > Signed-off-by: Jeff Layton <jlayton@xxxxxxxxxx> > --- > Here's a revised patch that does a fast_lookup in the O_CREAT codepath > too. The main difference here is that if a positive dentry is found and > audit_dummy_context is true, then we keep the walk lazy for the last > component, which avoids having to take any locks on the parent (just > like with non-O_CREAT opens). > > The testcase below runs in about 18s on v6.10 (on an 80 CPU machine). > With this patch, it runs in about 1s: > I don't have an opinion on the patch. If your kernel does not use apparmor and the patch manages to dodge refing the parent, then indeed this should be fully deserialized just like non-creat opens. Instead of the hand-rolled benchmark may I interest you in using will-it-scale instead? Notably it reports the achieved rate once per second, so you can check if there is funky business going on between reruns, gives the cpu the time to kick off turbo boost if applicable etc. I would bench with that myself, but I temporarily don't have handy access to bigger hw. Even so, the below is completely optional and perhaps more of a suggestion for the future :) I hacked up the test case based on tests/open1.c. git clone https://github.com/antonblanchard/will-it-scale.git For example plop into tests/opencreate1.c && gmake && ./opencreate1_processes -t 70: #include <stdlib.h> #include <unistd.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <assert.h> #include <string.h> char *testcase_description = "Separate file open/close + O_CREAT"; #define template "/tmp/willitscale.XXXXXX" static char (*tmpfiles)[sizeof(template)]; static unsigned long local_nr_tasks; void testcase_prepare(unsigned long nr_tasks) { int i; tmpfiles = (char(*)[sizeof(template)])malloc(sizeof(template) * nr_tasks); assert(tmpfiles); for (i = 0; i < nr_tasks; i++) { strcpy(tmpfiles[i], template); char *tmpfile = tmpfiles[i]; int fd = mkstemp(tmpfile); assert(fd >= 0); close(fd); } local_nr_tasks = nr_tasks; } void testcase(unsigned long long *iterations, unsigned long nr) { char *tmpfile = tmpfiles[nr]; while (1) { int fd = open(tmpfile, O_RDWR | O_CREAT, 0600); assert(fd >= 0); close(fd); (*iterations)++; } } void testcase_cleanup(void) { int i; for (i = 0; i < local_nr_tasks; i++) { unlink(tmpfiles[i]); } free(tmpfiles); } > --- > fs/namei.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++++++++--------- > 1 file changed, 53 insertions(+), 9 deletions(-) > > diff --git a/fs/namei.c b/fs/namei.c > index 1e05a0f3f04d..2d716fb114c9 100644 > --- a/fs/namei.c > +++ b/fs/namei.c > @@ -3518,6 +3518,47 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file, > return ERR_PTR(error); > } > > +static inline bool trailing_slashes(struct nameidata *nd) > +{ > + return (bool)nd->last.name[nd->last.len]; > +} > + > +static struct dentry *lookup_fast_for_open(struct nameidata *nd, int open_flag) > +{ > + struct dentry *dentry; > + > + if (open_flag & O_CREAT) { > + /* Don't bother on an O_EXCL create */ > + if (open_flag & O_EXCL) > + return NULL; > + > + /* > + * FIXME: If auditing is enabled, then we'll have to unlazy to > + * use the dentry. For now, don't do this, since it shifts > + * contention from parent's i_rwsem to its d_lockref spinlock. > + * Reconsider this once dentry refcounting handles heavy > + * contention better. > + */ > + if ((nd->flags & LOOKUP_RCU) && !audit_dummy_context()) > + return NULL; > + } > + > + if (trailing_slashes(nd)) > + nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY; > + > + dentry = lookup_fast(nd); > + > + if (open_flag & O_CREAT) { > + /* Discard negative dentries. Need inode_lock to do the create */ > + if (dentry && !dentry->d_inode) { > + if (!(nd->flags & LOOKUP_RCU)) > + dput(dentry); > + dentry = NULL; > + } > + } > + return dentry; > +} > + > static const char *open_last_lookups(struct nameidata *nd, > struct file *file, const struct open_flags *op) > { > @@ -3535,28 +3576,31 @@ static const char *open_last_lookups(struct nameidata *nd, > return handle_dots(nd, nd->last_type); > } > > + /* We _can_ be in RCU mode here */ > + dentry = lookup_fast_for_open(nd, open_flag); > + if (IS_ERR(dentry)) > + return ERR_CAST(dentry); > + > if (!(open_flag & O_CREAT)) { > - if (nd->last.name[nd->last.len]) > - nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY; > - /* we _can_ be in RCU mode here */ > - dentry = lookup_fast(nd); > - if (IS_ERR(dentry)) > - return ERR_CAST(dentry); > if (likely(dentry)) > goto finish_lookup; > > if (WARN_ON_ONCE(nd->flags & LOOKUP_RCU)) > return ERR_PTR(-ECHILD); > } else { > - /* create side of things */ > if (nd->flags & LOOKUP_RCU) { > + /* can stay in rcuwalk if not auditing */ > + if (dentry && audit_dummy_context()) > + goto check_slashes; > if (!try_to_unlazy(nd)) > return ERR_PTR(-ECHILD); > } > audit_inode(nd->name, dir, AUDIT_INODE_PARENT); > - /* trailing slashes? */ > - if (unlikely(nd->last.name[nd->last.len])) > +check_slashes: > + if (trailing_slashes(nd)) > return ERR_PTR(-EISDIR); > + if (dentry) > + goto finish_lookup; > } > > if (open_flag & (O_CREAT | O_TRUNC | O_WRONLY | O_RDWR)) { > > --- > base-commit: 0c3836482481200ead7b416ca80c68a29cfdaabd > change-id: 20240723-openfast-ac49a7b6ade2 > > Best regards, > -- > Jeff Layton <jlayton@xxxxxxxxxx> > -- Mateusz Guzik <mjguzik gmail.com>