Re: d_revalidate and real_lookup woes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

I dropped this real_lookup() problem for a while, but it just bit us again 
and I think we have a cleaner solution.  (BTW thanks for looking at this 
originally, Andreas!)

On Fri, 22 Aug 2008, Andreas Dilger wrote:
> On Aug 21, 2008  16:48 -0700, Sage Weil wrote:
> > The d_revalidate() behavior in real_lookup() currently bails out with 
> > -ENOENT when it encounters a race with i_mutex that confuses it.  
> > Specifically:
> > 
> > - in do_lookup(), if no dentry is found (or do_revalidate() fails), we 
> > call real_lookup().
> > - real_lookup() takes the dir i_mutex.
> > - if the dentry now exists (is found by d_lookup()), real_lookup() drops 
> > i_mutex, and calls do_revalidate() (again).
> > - if do_revalidate() fails now, we return -ENOENT.
> > 
> > This "two chances or -ENOENT" strategy strikes me as fundamentally racy.  
> > If, while we wait for i_mutex, something else hashes the dentry, but for 
> > one reason or another leaves it in a state in which d_revalidate will 
> > fail, we get -ENOENT instead of a fresh call to lookup.  For (reasonably 
> > long) timeout-based revalidation schemes, that doesn't really happen, but 
> > if your timeout is sufficiently short and i_mutex is contended, or you are 
> > explicitly invalidating a dentry (e.g. due to a callback from the server), 
> > it can happen more frequently. 
> 
> Yes, we had all sorts of problems with this code with Lustre, because it
> falls into the category of "server revokes lock rapidly on contended
> directory".

[...]

> >  - A new ->d_revalidate_locked() method that can be called with the dir 
> > i_mutex held would allow real_lookup() to decide whether to call 
> > ->lookup() while avoiding any additional race/starvaction issues.
> 
> We did that also, moving looping inside the mutex_lock(), and moving
> mutex_unlock() until after the revalidate.  After some code inspection,
> the only filesystem that touched i_mutex inside revalidate was devfs
> (at the time), but that isn't in the tree anymore.  I'm not sure if
> other filesystems have changed since that time to use i_mutex there.

We did a bit more inspection, and noticed that d_revalidate is _already_ 
called with i_mutex held, in lots of places.  do_filp_open(), 
lookup_create(), do_unlinkat(), do_rmdir(), and possibly others all take 
i_mutex, and then

-> lookup_hash
	-> __lookup_hash
		-> cached_lookup
			-> do_revalidate

Assuming that this is all okay, I think the cleaner solution is to just 
do_revalidate in real_lookup with i_mutex still held, and avoid any 
subsequent races altogether.  This simplifies the function a bit and gets 
rid of the nasty -ENOENT behavior.  We have a reliable test case (on ceph) 
and this fixes it.

Does this look reasonable?  If so, it would be great to get a fix for this 
upstream one way or another.  It is highly unlikely to bite existing in 
tree file systems like NFS with the spurious -ENOENT, but it's certainly 
possible, and should be fixed regardless.

Thanks-
sage


Signed-off-by: Yehuda Sadeh <yehuda@xxxxxxxxxxxx>
Signed-off-by: Sage Weil <sage@xxxxxxxxxxxx>
---

diff --git a/fs/namei.c b/fs/namei.c
index 7a7949c..698b36c 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -475,6 +475,7 @@ static struct dentry * real_lookup(struct dentry * 
parent, struct qstr * name, s
 {
        struct dentry * result;
        struct inode *dir = parent->d_inode;
+       struct dentry *dentry;
 
        mutex_lock(&dir->i_mutex);
        /*
@@ -492,38 +493,41 @@ static struct dentry * real_lookup(struct dentry * 
parent, struct qstr * name, s
         * so doing d_lookup() (with seqlock), instead of lockfree 
__d_lookup
         */
        result = d_lookup(parent, name);
-       if (!result) {
-               struct dentry *dentry;
-
-               /* Don't create child dentry for a dead directory. */
-               result = ERR_PTR(-ENOENT);
-               if (IS_DEADDIR(dir))
-                       goto out_unlock;
-
-               dentry = d_alloc(parent, name);
-               result = ERR_PTR(-ENOMEM);
-               if (dentry) {
-                       result = dir->i_op->lookup(dir, dentry, nd);
+       if (result) {
+               /*
+                * Uhhuh! Nasty case: the cache was re-populated while
+                * we waited on the mutex. Need to revalidate, this
+                * time with i_mutex held to avoid any subsequent
+                * race.
+                */
+               if (result->d_op && result->d_op->d_revalidate) {
+                       result = do_revalidate(result, nd);
                        if (result)
-                               dput(dentry);
-                       else
-                               result = dentry;
+                               goto out_unlock;
+                       /*
+                        * Not only did we race on the dentry, but it was
+                        * left behind invalid. We'll just do the lookup.
+                        */
                }
-out_unlock:
-               mutex_unlock(&dir->i_mutex);
-               return result;
        }
 
-       /*
-        * Uhhuh! Nasty case: the cache was re-populated while
-        * we waited on the semaphore. Need to revalidate.
-        */
-       mutex_unlock(&dir->i_mutex);
-       if (result->d_op && result->d_op->d_revalidate) {
-               result = do_revalidate(result, nd);
-               if (!result)
-                       result = ERR_PTR(-ENOENT);
+       /* Don't create child dentry for a dead directory. */
+       result = ERR_PTR(-ENOENT);
+       if (IS_DEADDIR(dir))
+               goto out_unlock;
+
+       dentry = d_alloc(parent, name);
+       result = ERR_PTR(-ENOMEM);
+       if (dentry) {
+               result = dir->i_op->lookup(dir, dentry, nd);
+               if (result)
+                       dput(dentry);
+               else
+                       result = dentry;
        }
+out_unlock:
+       mutex_unlock(&dir->i_mutex);
+
        return result;
 }



--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux