On Aug 21, 2008 16:48 -0700, Sage Weil wrote: > I came across two somewhat related issues with d_revalidate, and wanted to > find out if these were either known issues (possibly even addressed by > Al's planned dcache changes?) or just the product of my own confusion.. > > The d_revalidate() behavior in real_lookup() currently bails out with > -ENOENT when it encounters a race with i_mutex that confuses it. > Specifically: > > - in do_lookup(), if no dentry is found (or do_revalidate() fails), we > call real_lookup(). > - real_lookup() takes the dir i_mutex. > - if the dentry now exists (is found by d_lookup()), real_lookup() drops > i_mutex, and calls do_revalidate() (again). > - if do_revalidate() fails now, we return -ENOENT. > > This "two chances or -ENOENT" strategy strikes me as fundamentally racy. > If, while we wait for i_mutex, something else hashes the dentry, but for > one reason or another leaves it in a state in which d_revalidate will > fail, we get -ENOENT instead of a fresh call to lookup. For (reasonably > long) timeout-based revalidation schemes, that doesn't really happen, but > if your timeout is sufficiently short and i_mutex is contended, or you are > explicitly invalidating a dentry (e.g. due to a callback from the server), > it can happen more frequently. Yes, we had all sorts of problems with this code with Lustre, because it falls into the category of "server revokes lock rapidly on contended directory". > It also looks like we can get into trouble if a ->d_revalidate() doesn't > d_drop() when it fails: do_revalidate() will try to d_invalidate(), but > that can -EBUSY if the dentry has non-prunable children, and > do_revalidate() silently drops that error condition. If I'm reading this > correctly, that means if we ever get a hashed dentry with children for > which ->d_revalidate() fails (and doesn't d_drop()), we will get -ENOENT > forever without ever calling ->lookup(). Hmm, this might even be a bug we see occasionally that hasn't been investigated deeply enough... > - I looked a bit at whether simply looping in real_lookup() would address > the issue (i.e. retake i_mutex, d_lookup(), call ->lookup() as needed). > Starvation issues aside, that can loop forever if it gets a > bad-but-hashed dentry (as described above). We put in an arbitrary limit of 10 retries in real_lookup() before returning -ESTALE. > - Instead of returning -ENOENT (which seems kind of unlikely to ever be > the "right" answer), real_lookup() could call ->lookup(), even though the > dentry is hashed. Assuming that's even allowed (?), that might lead to > more ->lookup() calls than we need... > > - A new ->d_revalidate_locked() method that can be called with the dir > i_mutex held would allow real_lookup() to decide whether to call > ->lookup() while avoiding any additional race/starvaction issues. We did that also, moving looping inside the mutex_lock(), and moving mutex_unlock() until after the revalidate. After some code inspection, the only filesystem that touched i_mutex inside revalidate was devfs (at the time), but that isn't in the tree anymore. I'm not sure if other filesystems have changed since that time to use i_mutex there. A patch against the SLES10 kernel is below. diff -urNp linux-2.6.16.21-0.8.orig/fs/namei.c linux-2.6.16.21-0.8/fs/namei.c --- linux-2.6.16.21-0.8.orig/fs/namei.c 2006-10-04 02:18:11.000000000 +0300 +++ linux-2.6.16.21-0.8/fs/namei.c 2007-01-29 18:21:10.000000000 +0200 @@ -440,8 +451,12 @@ static struct dentry * real_lookup(struc { struct dentry * result; struct inode *dir = parent->d_inode; + int counter = 0; mutex_lock(&dir->i_mutex); +again: + counter++; + /* * First re-do the cached lookup just in case it was created * while we waited for the directory semaphore.. @@ -475,13 +490,16 @@ static struct dentry * real_lookup(struc * Uhhuh! Nasty case: the cache was re-populated while * we waited on the semaphore. Need to revalidate. */ - mutex_unlock(&dir->i_mutex); if (result->d_op && result->d_op->d_revalidate) { if (!result->d_op->d_revalidate(result, nd) && !d_invalidate(result)) { dput(result); - result = ERR_PTR(-ENOENT); + if (counter > 10) + result = ERR_PTR(-ESTALE); + if (!IS_ERR(result)) + goto again; } } + mutex_unlock(&dir->i_mutex); return result; } Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html