On Thu, 06 Feb 2025, Christian Brauner wrote: > On Thu, Feb 06, 2025 at 04:42:38PM +1100, NeilBrown wrote: > > vfs_mkdir() does not guarantee to make the child dentry positive on > > success. It may leave it negative and then the caller needs to perform a > > lookup to find the target dentry. > > > > This patch introduced vfs_mkdir_return() which performs the lookup if > > needed so that this code is centralised. > > > > This prepares for a new inode operation which will perform mkdir and > > returns the correct dentry. > > > > Signed-off-by: NeilBrown <neilb@xxxxxxx> > > --- > > fs/cachefiles/namei.c | 7 +--- > > fs/namei.c | 69 ++++++++++++++++++++++++++++++++++++++++ > > fs/nfsd/vfs.c | 21 ++---------- > > fs/overlayfs/dir.c | 33 +------------------ > > fs/overlayfs/overlayfs.h | 10 +++--- > > fs/overlayfs/super.c | 2 +- > > fs/smb/server/vfs.c | 24 +++----------- > > include/linux/fs.h | 2 ++ > > 8 files changed, 86 insertions(+), 82 deletions(-) > > > > diff --git a/fs/cachefiles/namei.c b/fs/cachefiles/namei.c > > index 7cf59713f0f7..3c866c3b9534 100644 > > --- a/fs/cachefiles/namei.c > > +++ b/fs/cachefiles/namei.c > > @@ -95,7 +95,6 @@ struct dentry *cachefiles_get_directory(struct cachefiles_cache *cache, > > /* search the current directory for the element name */ > > inode_lock_nested(d_inode(dir), I_MUTEX_PARENT); > > > > -retry: > > ret = cachefiles_inject_read_error(); > > if (ret == 0) > > subdir = lookup_one_len(dirname, dir, strlen(dirname)); > > @@ -130,7 +129,7 @@ struct dentry *cachefiles_get_directory(struct cachefiles_cache *cache, > > goto mkdir_error; > > ret = cachefiles_inject_write_error(); > > if (ret == 0) > > - ret = vfs_mkdir(&nop_mnt_idmap, d_inode(dir), subdir, 0700); > > + ret = vfs_mkdir_return(&nop_mnt_idmap, d_inode(dir), &subdir, 0700); > > if (ret < 0) { > > trace_cachefiles_vfs_error(NULL, d_inode(dir), ret, > > cachefiles_trace_mkdir_error); > > @@ -138,10 +137,6 @@ struct dentry *cachefiles_get_directory(struct cachefiles_cache *cache, > > } > > trace_cachefiles_mkdir(dir, subdir); > > > > - if (unlikely(d_unhashed(subdir))) { > > - cachefiles_put_directory(subdir); > > - goto retry; > > - } > > ASSERT(d_backing_inode(subdir)); > > > > _debug("mkdir -> %pd{ino=%lu}", > > diff --git a/fs/namei.c b/fs/namei.c > > index 3ab9440c5b93..d98caf36e867 100644 > > --- a/fs/namei.c > > +++ b/fs/namei.c > > @@ -4317,6 +4317,75 @@ int vfs_mkdir(struct mnt_idmap *idmap, struct inode *dir, > > } > > EXPORT_SYMBOL(vfs_mkdir); > > > > +/** > > + * vfs_mkdir_return - create directory returning correct dentry > > + * @idmap: idmap of the mount the inode was found from > > + * @dir: inode of the parent directory > > + * @dentryp: pointer to dentry of the child directory > > + * @mode: mode of the child directory > > + * > > + * Create a directory. > > + * > > + * If the inode has been found through an idmapped mount the idmap of > > + * the vfsmount must be passed through @idmap. This function will then take > > + * care to map the inode according to @idmap before checking permissions. > > + * On non-idmapped mounts or if permission checking is to be performed on the > > + * raw inode simply pass @nop_mnt_idmap. > > + * > > + * The filesystem may not use the dentry that was passed in. In that case > > + * the passed-in dentry is put and a new one is placed in *@dentryp; > > + * So on successful return *@dentryp will always be positive. > > + */ > > +int vfs_mkdir_return(struct mnt_idmap *idmap, struct inode *dir, > > + struct dentry **dentryp, umode_t mode) > > +{ > > I think this is misnamed. Maybe vfs_mkdir_positive() is better here. > It also be nice to have a comment on vfs_mkdir() as well pointing out > that the returned dentry might be negative. While I'm not particularly fond of vfs_mkdir_return(), I don't see that vfs_mkdir_positive() is an improvement. I cannot find any relevant precedent in the kernel to guide. Most _return and _positive functions are for low-level counting primitives :-) I'm tempted to add another arg to vfs_mkdir() instead of adding a new function. That would solve one problem by introducing another: what arg? Maybe pass both a 'struct dentry *' and a 'struct dentry **' and if the latter is not NULL, it gets filled with the new dentry if there is one. > > And is there a particular reason to not have it return the new dentry? > That seems clearer than using the argument as a return value. If I did that then every caller would need to check if the return value was not IS_ERR_OR_NULL() and if so, dput() the original dentry and keep the new one - just like current callers of ->lookup need to. It seems cleaner to do that once in vfs_mkdir_return() rather than in all the callers. I guess we could *always* return the dentry on success and dput the old one if it was different or if there were an error. So dentry = vfs_mkdir_return(idmap, inode, dentry, mode) would be the common pattern. Would you be OK with that? > > > + struct dentry *dentry = *dentryp; > > + int error; > > + unsigned max_links = dir->i_sb->s_max_links; > > + > > + error = may_create(idmap, dir, dentry); > > + if (error) > > + return error; > > + > > + if (!dir->i_op->mkdir) > > + return -EPERM; > > + > > + mode = vfs_prepare_mode(idmap, dir, mode, S_IRWXUGO | S_ISVTX, 0); > > + error = security_inode_mkdir(dir, dentry, mode); > > + if (error) > > + return error; > > + > > + if (max_links && dir->i_nlink >= max_links) > > + return -EMLINK; > > + > > + error = dir->i_op->mkdir(idmap, dir, dentry, mode); > > Why isn't this calling vfs_mkdir() and then only starts differing afterwards? Because once we introduce the new ->mkdir_async which returns a dentry the two functions start to diverge more. I could have a __vfs_mkdir() which does both and has a bool arg to tell it if we want the return value. That would avoid code duplication. > > > + if (!error) { > > + fsnotify_mkdir(dir, dentry); > > + if (unlikely(d_unhashed(dentry))) { > > + struct dentry *d; > > + /* Need a "const" pointer. We know d_name is const > > + * because we hold an exclusive lock on i_rwsem > > + * in d_parent. > > + */ > > + const struct qstr *d_name = (void*)&dentry->d_name; > > + d = lookup_dcache(d_name, dentry->d_parent, 0); > > + if (!d) > > + d = __lookup_slow(d_name, dentry->d_parent, 0); > > Quite a few caller's use lookup_one() here which calls > inode_permission() on @dir again. Are we guaranteed that the permission > check would always pass? I think they use lookup_one() because that was the easiest, not because they need all the functionality. If the process had permission to create a directory with a given name but now doesn't have permission to look up that same name, then something is weird. Maybe a race with permission changing could do that. But I think the process should have the right to hold the dentry that it has just successfully created. The lookup is hopefully just a work-around until the new improved interface is used by all relevant filesystems. Thanks, NeilBrown