Re: d_revalidate and real_lookup woes

Andreas Dilger <adilger@xxxxxxx> · Fri, 22 Aug 2008 17:18:19 -0600

On Aug 21, 2008  16:48 -0700, Sage Weil wrote:
> I came across two somewhat related issues with d_revalidate, and wanted to 
> find out if these were either known issues (possibly even addressed by 
> Al's planned dcache changes?) or just the product of my own confusion..
> 
> The d_revalidate() behavior in real_lookup() currently bails out with 
> -ENOENT when it encounters a race with i_mutex that confuses it.  
> Specifically:
> 
> - in do_lookup(), if no dentry is found (or do_revalidate() fails), we 
> call real_lookup().
> - real_lookup() takes the dir i_mutex.
> - if the dentry now exists (is found by d_lookup()), real_lookup() drops 
> i_mutex, and calls do_revalidate() (again).
> - if do_revalidate() fails now, we return -ENOENT.
> 
> This "two chances or -ENOENT" strategy strikes me as fundamentally racy.  
> If, while we wait for i_mutex, something else hashes the dentry, but for 
> one reason or another leaves it in a state in which d_revalidate will 
> fail, we get -ENOENT instead of a fresh call to lookup.  For (reasonably 
> long) timeout-based revalidation schemes, that doesn't really happen, but 
> if your timeout is sufficiently short and i_mutex is contended, or you are 
> explicitly invalidating a dentry (e.g. due to a callback from the server), 
> it can happen more frequently. 

Yes, we had all sorts of problems with this code with Lustre, because it
falls into the category of "server revokes lock rapidly on contended
directory".

> It also looks like we can get into trouble if a ->d_revalidate() doesn't 
> d_drop() when it fails: do_revalidate() will try to d_invalidate(), but 
> that can -EBUSY if the dentry has non-prunable children, and 
> do_revalidate() silently drops that error condition.  If I'm reading this 
> correctly, that means if we ever get a hashed dentry with children for 
> which ->d_revalidate() fails (and doesn't d_drop()), we will get -ENOENT 
> forever without ever calling ->lookup().

Hmm, this might even be a bug we see occasionally that hasn't been
investigated deeply enough...

> - I looked a bit at whether simply looping in real_lookup() would address 
> the issue (i.e. retake i_mutex, d_lookup(), call ->lookup() as needed).  
> Starvation issues aside, that can loop forever if it gets a 
> bad-but-hashed dentry (as described above).

We put in an arbitrary limit of 10 retries in real_lookup() before
returning -ESTALE.

> - Instead of returning -ENOENT (which seems kind of unlikely to ever be 
> the "right" answer), real_lookup() could call ->lookup(), even though the 
> dentry is hashed.  Assuming that's even allowed (?), that might lead to 
> more ->lookup() calls than we need...
> 
>  - A new ->d_revalidate_locked() method that can be called with the dir 
> i_mutex held would allow real_lookup() to decide whether to call 
> ->lookup() while avoiding any additional race/starvaction issues.

We did that also, moving looping inside the mutex_lock(), and moving
mutex_unlock() until after the revalidate.  After some code inspection,
the only filesystem that touched i_mutex inside revalidate was devfs
(at the time), but that isn't in the tree anymore.  I'm not sure if
other filesystems have changed since that time to use i_mutex there.

A patch against the SLES10 kernel is below.

diff -urNp linux-2.6.16.21-0.8.orig/fs/namei.c linux-2.6.16.21-0.8/fs/namei.c

--- linux-2.6.16.21-0.8.orig/fs/namei.c	2006-10-04 02:18:11.000000000 +0300
+++ linux-2.6.16.21-0.8/fs/namei.c	2007-01-29 18:21:10.000000000 +0200
@@ -440,8 +451,12 @@ static struct dentry * real_lookup(struc
 {
 	struct dentry * result;
 	struct inode *dir = parent->d_inode;
+	int counter = 0;
 
 	mutex_lock(&dir->i_mutex);
+again:
+	counter++;
+
 	/*
 	 * First re-do the cached lookup just in case it was created
 	 * while we waited for the directory semaphore..
@@ -475,13 +490,16 @@ static struct dentry * real_lookup(struc
 	 * Uhhuh! Nasty case: the cache was re-populated while
 	 * we waited on the semaphore. Need to revalidate.
 	 */
-	mutex_unlock(&dir->i_mutex);
 	if (result->d_op && result->d_op->d_revalidate) {
 		if (!result->d_op->d_revalidate(result, nd) && !d_invalidate(result)) {
 			dput(result);
-			result = ERR_PTR(-ENOENT);
+			if (counter > 10)
+				result = ERR_PTR(-ESTALE);
+			if (!IS_ERR(result))
+				goto again;
 		}
 	}
+	mutex_unlock(&dir->i_mutex);
 	return result;
 }
 
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html