On Sat, 27 Aug 2022, Al Viro wrote: > On Fri, Aug 26, 2022 at 12:10:43PM +1000, NeilBrown wrote: > > > +/** > > + * d_lock_update_nested - lock a dentry before updating > > + * @dentry: the dentry to be locked > > + * @base: the parent, or %NULL > > + * @name: the name in that parent, or %NULL > > + * @subclass: lockdep locking class. > > + * > > + * Lock a dentry in a directory on which a shared-lock may be held, and > > + * on which parallel updates are permitted. > > + * If the base and name are given, then on success the dentry will still > > + * have that base and name - it will not have raced with rename. > > + * On success, a positive dentry will still be hashed, ensuring there > > + * was no race with unlink. > > + * If they are not given, then on success the dentry will be negative, > > + * which again ensures no race with rename, or unlink. > > I'm not sure it's a good idea to have that in one function, TBH. > Looking at the callers, there are > * lookup_hash_update() > lookup_hash_update_len() > nfsd shite > filename_create_one() > filename_create_one_len() > nfsd shite > filename_create() > kern_path_create() > user_path_create() > do_mknodat() > do_mkdirat() > do_symlinkat() > do_linkat() > do_rmdir() > do_unlinkat() > * fuckloads of callers in lock_rename_lookup() > * d_lock_update() > atomic_open() > lookup_open() > > Only the last two can get NULL base or name. Already interesting, > isn't it? What's more, splitup between O_CREATE open() on one > side and everything else that might create, remove or rename on > the other looks really weird. Well, O_CREATE is a bit weird. But I can see that it would be cleaner to pass the dir/name into those two calls that currently get NULL - then remove the handling of NULL. Thanks. > > > + rcu_read_lock(); /* for d_same_name() */ > > + if (d_unhashed(dentry) && d_is_positive(dentry)) { > > + /* name was unlinked while we waited */ > > + ret = false; > > BTW, what happens if somebody has ->lookup() returning a positive > unhashed? Livelock on attempt to hit it with any directory-modifying > syscall? Right now such behaviour is permitted; I don't know if > anything in the tree actually does it, but at the very least it > would need to be documented. > > Note that *negative* unhashed is not just permitted, it's > actively used e.g. by autofs. That really needs to be well > commented... I hadn't thought that ->lookup() could return anything unhashed. I'll look into that - thanks. > > > + } else if (base && > > + (dentry->d_parent != base || > > + dentry->d_name.hash != name->hash || > > + !d_same_name(dentry, base, name))) { > > + /* dentry was renamed - possibly silly-rename */ > > + ret = false; > > + } else if (!base && d_is_positive(dentry)) { > > + ret = false; > > + } else { > > + dentry->d_flags |= DCACHE_PAR_UPDATE; > > + } > > > + * Parent directory has inode locked exclusive, or possibly shared if wq > > + * is given. In the later case the fs has explicitly allowed concurrent > > + * updates in this directory. This is the one and only case > > + * when ->lookup() may be called on a non in-lookup dentry. > > What Linus already said about wq... To add a reason he hadn't mentioned, > the length of call chain one needs to track to figure out whether it's > NULL or not is... excessive. And I don't mean just "greater than 0". > We have places like that, and sometimes we have to, but it's never a good > thing... > > > static struct dentry *__lookup_hash(const struct qstr *name, > > - struct dentry *base, unsigned int flags) > > + struct dentry *base, unsigned int flags, > > + wait_queue_head_t *wq) > > > - dentry = d_alloc(base, name); > > + if (wq) > > + dentry = d_alloc_parallel(base, name, wq); > > + else > > + dentry = d_alloc(base, name); > > if (unlikely(!dentry)) > > return ERR_PTR(-ENOMEM); > > + if (IS_ERR(dentry)) > > + return dentry; > > BTW, considering the fact that we have 12 callers of d_alloc() > (including this one) and 28 callers of its wrapper (d_alloc_name()), I > would seriously consider converting d_alloc() from "NULL or new dentry" > to "ERR_PTR(-ENOMEM) or new dentry". Especially since quite a few of > those callers will be happier that way. Grep and you'll see... As a > side benefit, if (unlikely(!dentry)) turns into if (IS_ERR(dentry)). > > > +static struct dentry *lookup_hash_update( > > + const struct qstr *name, > > + struct dentry *base, unsigned int flags, > > + wait_queue_head_t *wq) > > +{ > > + struct dentry *dentry; > > + struct inode *dir = base->d_inode; > > + int err; > > + > > + if (wq && IS_PAR_UPDATE(dir)) > > + inode_lock_shared_nested(dir, I_MUTEX_PARENT); > > + else > > + inode_lock_nested(dir, I_MUTEX_PARENT); > > + > > +retry: > > + dentry = __lookup_hash(name, base, flags, wq); > > + if (IS_ERR(dentry)) { > > + err = PTR_ERR(dentry); > > + goto out_err; > > + } > > + if (!d_lock_update_nested(dentry, base, name, I_MUTEX_PARENT)) { > > + /* > > + * Failed to get lock due to race with unlink or rename > > + * - try again > > + */ > > + d_lookup_done(dentry); > > When would we get out of __lookup_hash() with in-lookup dentry? > Confused... Whenever wq is passed in and ->lookup() decides, based on the flags, to do nothing. NFS does this for LOOKUP_CREATE|LOOKUP_EXCL and for LOOKUP_RENAME_TARGET > > > +struct dentry *lookup_hash_update_len(const char *name, int nlen, > > + struct path *path, unsigned int flags, > > const struct path *, damnit... Sorry.... > > > + wait_queue_head_t *wq) > > +{ > > + struct qstr this; > > + int err = lookup_one_common(mnt_user_ns(path->mnt), name, > > + path->dentry, nlen, &this); > > + if (err) > > + return ERR_PTR(err); > > + return lookup_hash_update(&this, path->dentry, flags, wq); > > +} > > +EXPORT_SYMBOL(lookup_hash_update_len); > > Frankly, the calling conventions of the "..._one_len" family is something > I've kept regretting for a long time. Oh, well... > > > +static void done_path_update(struct path *path, struct dentry *dentry, > > + bool with_wq) > > +{ > > + struct inode *dir = path->dentry->d_inode; > > + > > + d_lookup_done(dentry); > > + d_unlock_update(dentry); > > + if (IS_PAR_UPDATE(dir) && with_wq) > > + inode_unlock_shared(dir); > > + else > > + inode_unlock(dir); > > +} > > const struct path *, again... > > > @@ -3400,6 +3499,12 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file, > > dentry = res; > > } > > } > > + /* > > + * If dentry is negative and this is a create we need to get > > + * DCACHE_PAR_UPDATE. > > + */ > > + if ((open_flag & O_CREAT) && !dentry->d_inode) > > + have_par_update = d_lock_update(dentry, NULL, NULL); > > > > /* Negative dentry, just create the file */ > > if (!dentry->d_inode && (open_flag & O_CREAT)) { > > Fold the above here, please. What's more, losing the flag would've > made it much easier to follow... Yes, that can certainly be tided up - thanks. > > > @@ -3419,9 +3524,13 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file, > > error = create_error; > > goto out_dput; > > } > > + if (have_par_update) > > + d_unlock_update(dentry); > > return dentry; > > > > out_dput: > > + if (have_par_update) > > + d_unlock_update(dentry); > > > > @@ -3821,27 +3962,28 @@ struct dentry *kern_path_create(int dfd, const char *pathname, > > struct path *path, unsigned int lookup_flags) > > BTW, there's 9 callers of that sucker in the entire tree, along with > 3 callers of user_path_create() and 14 callers of done_path_create(). > Not a big deal to add the wq in those, especially since it can be easily > split into preparatory patch (with wq passed, but being unused). > > > -void done_path_create(struct path *path, struct dentry *dentry) > > +void done_path_create_wq(struct path *path, struct dentry *dentry, bool with_wq) > > Why "with_wq" and not the wq itself? > I did that at first. However when I did silly rename in NFS I found that I wanted to call the 'done' thing when I didn't still have the wq. ...which might mean I have a bug. And the done_path_create_wq() doesn't do anything with the wq so ... I'll look at that again - thanks. > > - * The caller must hold dir->i_mutex. > > + * The caller must either hold a write-lock on dir->i_rwsem, or > > + * a have atomically set DCACHE_PAR_UPDATE, or both. > > ??? That's a hangover from an earlier version where I didn't set DCACHE_PAR_UPDATE when we had an exclusive lock. Will fix. > > > + * If the filesystem permits (IS_PAR_UPDATE()), we take a shared lock on the > > + * directory and set DCACHE_PAR_UPDATE to get exclusive access to the dentry. > > The latter happens unconditionally here, unless I'm misreading the code (as well > as your cover letter). It does *NOT* happen on rename(), though, contrary to > the same. And while your later patch adds it in lock_rename_lookup(), existing > lock_rename() callers do not get that at all. Likely to be a problem... Between the patch were DCACHE_PAR_UPDATE is added and the patch were lock_rename_lookup() is added, all filesystems use exclusive locks and none of them check DCACHE_PAR_UPDATE. So how can there be a problem? > > > --- a/include/linux/dcache.h > > +++ b/include/linux/dcache.h > > @@ -13,7 +13,9 @@ > > #include <linux/rcupdate.h> > > #include <linux/lockref.h> > > #include <linux/stringhash.h> > > +#include <linux/sched.h> > > Bloody hell, man... > I wonder what that was for .... removed. > > +static inline void d_unlock_update(struct dentry *dentry) > > +{ > > + if (IS_ERR_OR_NULL(dentry)) > > + return; > > Do explain... When could we ever get NULL or ERR_PTR() passed to that? Another hangover from earlier iterations - removed. Thanks. > > > BTW, I would seriously look into splitting the "let's add a helper > that combines locking parent with __lookup_hash()" into a preliminary > patch. Would be easier to follow. > Will look into that. Thanks a lot for the thorough review! NeilBrown