Hi, I dropped this real_lookup() problem for a while, but it just bit us again and I think we have a cleaner solution. (BTW thanks for looking at this originally, Andreas!) On Fri, 22 Aug 2008, Andreas Dilger wrote: > On Aug 21, 2008 16:48 -0700, Sage Weil wrote: > > The d_revalidate() behavior in real_lookup() currently bails out with > > -ENOENT when it encounters a race with i_mutex that confuses it. > > Specifically: > > > > - in do_lookup(), if no dentry is found (or do_revalidate() fails), we > > call real_lookup(). > > - real_lookup() takes the dir i_mutex. > > - if the dentry now exists (is found by d_lookup()), real_lookup() drops > > i_mutex, and calls do_revalidate() (again). > > - if do_revalidate() fails now, we return -ENOENT. > > > > This "two chances or -ENOENT" strategy strikes me as fundamentally racy. > > If, while we wait for i_mutex, something else hashes the dentry, but for > > one reason or another leaves it in a state in which d_revalidate will > > fail, we get -ENOENT instead of a fresh call to lookup. For (reasonably > > long) timeout-based revalidation schemes, that doesn't really happen, but > > if your timeout is sufficiently short and i_mutex is contended, or you are > > explicitly invalidating a dentry (e.g. due to a callback from the server), > > it can happen more frequently. > > Yes, we had all sorts of problems with this code with Lustre, because it > falls into the category of "server revokes lock rapidly on contended > directory". [...] > > - A new ->d_revalidate_locked() method that can be called with the dir > > i_mutex held would allow real_lookup() to decide whether to call > > ->lookup() while avoiding any additional race/starvaction issues. > > We did that also, moving looping inside the mutex_lock(), and moving > mutex_unlock() until after the revalidate. After some code inspection, > the only filesystem that touched i_mutex inside revalidate was devfs > (at the time), but that isn't in the tree anymore. I'm not sure if > other filesystems have changed since that time to use i_mutex there. We did a bit more inspection, and noticed that d_revalidate is _already_ called with i_mutex held, in lots of places. do_filp_open(), lookup_create(), do_unlinkat(), do_rmdir(), and possibly others all take i_mutex, and then -> lookup_hash -> __lookup_hash -> cached_lookup -> do_revalidate Assuming that this is all okay, I think the cleaner solution is to just do_revalidate in real_lookup with i_mutex still held, and avoid any subsequent races altogether. This simplifies the function a bit and gets rid of the nasty -ENOENT behavior. We have a reliable test case (on ceph) and this fixes it. Does this look reasonable? If so, it would be great to get a fix for this upstream one way or another. It is highly unlikely to bite existing in tree file systems like NFS with the spurious -ENOENT, but it's certainly possible, and should be fixed regardless. Thanks- sage Signed-off-by: Yehuda Sadeh <yehuda@xxxxxxxxxxxx> Signed-off-by: Sage Weil <sage@xxxxxxxxxxxx> --- diff --git a/fs/namei.c b/fs/namei.c index 7a7949c..698b36c 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -475,6 +475,7 @@ static struct dentry * real_lookup(struct dentry * parent, struct qstr * name, s { struct dentry * result; struct inode *dir = parent->d_inode; + struct dentry *dentry; mutex_lock(&dir->i_mutex); /* @@ -492,38 +493,41 @@ static struct dentry * real_lookup(struct dentry * parent, struct qstr * name, s * so doing d_lookup() (with seqlock), instead of lockfree __d_lookup */ result = d_lookup(parent, name); - if (!result) { - struct dentry *dentry; - - /* Don't create child dentry for a dead directory. */ - result = ERR_PTR(-ENOENT); - if (IS_DEADDIR(dir)) - goto out_unlock; - - dentry = d_alloc(parent, name); - result = ERR_PTR(-ENOMEM); - if (dentry) { - result = dir->i_op->lookup(dir, dentry, nd); + if (result) { + /* + * Uhhuh! Nasty case: the cache was re-populated while + * we waited on the mutex. Need to revalidate, this + * time with i_mutex held to avoid any subsequent + * race. + */ + if (result->d_op && result->d_op->d_revalidate) { + result = do_revalidate(result, nd); if (result) - dput(dentry); - else - result = dentry; + goto out_unlock; + /* + * Not only did we race on the dentry, but it was + * left behind invalid. We'll just do the lookup. + */ } -out_unlock: - mutex_unlock(&dir->i_mutex); - return result; } - /* - * Uhhuh! Nasty case: the cache was re-populated while - * we waited on the semaphore. Need to revalidate. - */ - mutex_unlock(&dir->i_mutex); - if (result->d_op && result->d_op->d_revalidate) { - result = do_revalidate(result, nd); - if (!result) - result = ERR_PTR(-ENOENT); + /* Don't create child dentry for a dead directory. */ + result = ERR_PTR(-ENOENT); + if (IS_DEADDIR(dir)) + goto out_unlock; + + dentry = d_alloc(parent, name); + result = ERR_PTR(-ENOMEM); + if (dentry) { + result = dir->i_op->lookup(dir, dentry, nd); + if (result) + dput(dentry); + else + result = dentry; } +out_unlock: + mutex_unlock(&dir->i_mutex); + return result; } -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html