real_lookup() lock contention causes poor performance

Quentin Barnes <qbarnes@xxxxxxxxxxxxx> · Mon, 25 Jan 2010 11:09:55 -0600

In house, we developed our own variation of Theodore Ts'o and
Robert Love's open-by-inode patch for the ext3 file system.  The
open-by-inode patch allows user space to open a file by its inode
number under a special directory directly under the mount point
called ".inode" (e.g.: "<>/.inode/12345").  I also adapted the
concept for the NFS file system for open-by-file-handle using a
magic directory ".file_handle".

With the open-by-* extensions when used under load, we quickly
noticed a performance problem in the VFS layer.  We tracked it
down to processes blocking on the taking of the dir->i_mutex lock
in real_lookup().  That makes sense since the magic directories
constantly get hammered with opens for "new" files that aren't
yet in the dentry cache.  We create normal dentry entries for the
special files once opened, but there are so many unique files
in the file systems that dentry cache hits are low enough that
real_lookup() gets called often enough to be a real problem.

As a hack for this bottleneck, the developer for the ext3 extension
creates a 100 ".inodeXXX" directories that behave identically.
Then in an app he sequentially spreads the open-by-inode requests
among them.  That dilutes the directory lock contention in
real_lookup() enough for his situation.

Depending on the number of disks configured, his ".inodeXXX"
change gained him a considerable 5X-10X overall improvement
to his application.

For the NFS case, I took a different approach, I side-step the
lock.  I don't think the parent directory lock in real_lookup() of
the magic directory for these special needs cases is particularly
useful.  The parent directory lock protects against the parent
directory being renamed (moved) or deleted while the lookup is in
progress.  (Is that correct?  What else does it do?)  What I did
instead is set a flag in the magic directory's inode's "i_flags".
Then in real_lookup(), test for that flag and use it to bypass
taking dir->i_mutex lock.  Rather than create a new flag, I hijacked
S_IMMUTABLE since the magic directory is conceptually in a way
immutable.  This solved the lock contention for my NFS work gaining
me about a 2.5X improvement.

I consider what's been done so far just a quick and dirty way to
work around a problem.  What I'd like to discuss though is if there
is any practical way to design a real change to the VFS layer that
would be acceptable by the community that can either bypass the
dir->i_mutex lock for special cases like mentioned above or just
design out the bottleneck.  Or is there a radically different
approach for implementing open-by-* functionality that would bypass
the real_lookup() bottleneck?  Any suggestions?


For illustrative purposes, below is my real_lookup() patch.  I
noticed that the S_IMMUTABLE flag is already being set on some
directories by the kernel in the "/proc" fs code.  Any particular
reason this was done?  Would my patch cause any bad side-effects
for the /proc file system, or would actually help its performance
bypassing the dir->i_mutex lock for those directories?

Quentin

--- linux-2.6.32.5/fs/namei.c	2010-01-22 17:23:21.000000000 -0600
+++ linux-2.6.32.5-immdir/fs/namei.c	2010-01-24 23:10:43.000000000 -0600
@@ -478,8 +478,13 @@ static struct dentry * real_lookup(struc
 {
 	struct dentry * result;
 	struct inode *dir = parent->d_inode;
+	int dir_locked = 0;
+
+	if (likely(!IS_IMMUTABLE(dir))) {
+		mutex_lock(&dir->i_mutex);
+		dir_locked = 1;
+	}
 
-	mutex_lock(&dir->i_mutex);
 	/*
 	 * First re-do the cached lookup just in case it was created
 	 * while we waited for the directory semaphore..
@@ -513,7 +518,8 @@ static struct dentry * real_lookup(struc
 				result = dentry;
 		}
 out_unlock:
-		mutex_unlock(&dir->i_mutex);
+		if (likely(dir_locked))
+			mutex_unlock(&dir->i_mutex);
 		return result;
 	}
 
@@ -521,7 +527,8 @@ out_unlock:
 	 * Uhhuh! Nasty case: the cache was re-populated while
 	 * we waited on the semaphore. Need to revalidate.
 	 */
-	mutex_unlock(&dir->i_mutex);
+	if (likely(dir_locked))
+		mutex_unlock(&dir->i_mutex);
 	if (result->d_op && result->d_op->d_revalidate) {
 		result = do_revalidate(result, nd);
 		if (!result)
Quentin

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html