Re: real_lookup() lock contention causes poor performance

Andreas Dilger <adilger@xxxxxxx> · Mon, 25 Jan 2010 16:15:03 -0700

On 2010-01-25, at 10:09, Quentin Barnes wrote:
In house, we developed our own variation of Theodore Ts'o and
Robert Love's open-by-inode patch for the ext3 file system.  The
open-by-inode patch allows user space to open a file by its inode
number under a special directory directly under the mount point
called ".inode" (e.g.: "<>/.inode/12345").  I also adapted the
concept for the NFS file system for open-by-file-handle using a
magic directory ".file_handle".

We have had something similar for Lustre patched into ext3/ext4 for a  
long time -"__iopen__/{inum}", which depends on a mount option "iopen"  
to allow this functionality, and "iopen_nopriv" to allow access to non- 
root users.

The problem we had in the past is that this needed a bunch of changes  
in ext3/4 and the VFS in order to handle subtle races in the dcache.   
We're moving away from changing ext3/4 and changing the API to use  
export methods (i.e. ext4_get_dentry()) on each filesystem, which  
should simplify this a lot, though I'm not sure it will solve your  
scalability problem.

With the open-by-* extensions when used under load, we quickly
noticed a performance problem in the VFS layer.  We tracked it
down to processes blocking on the taking of the dir->i_mutex lock
in real_lookup().  That makes sense since the magic directories
constantly get hammered with opens for "new" files that aren't
yet in the dentry cache.  We create normal dentry entries for the
special files once opened, but there are so many unique files
in the file systems that dentry cache hits are low enough that
real_lookup() gets called often enough to be a real problem.

As a hack for this bottleneck, the developer for the ext3 extension
creates a 100 ".inodeXXX" directories that behave identically.
Then in an app he sequentially spreads the open-by-inode requests
among them.  That dilutes the directory lock contention in
real_lookup() enough for his situation.

We have a patch that avoids this problem, called dynlocks (previously  
pdirops) that allows locking only the hash of the filename instead of  
the whole directory.  For in-cache lookups this is perfectly  
sufficient (chance of hash collision is vanishingly small), but an  
additional patch is needed at the ext3/4 level to allow parallel  
lookup/add/remove in an htree directory.  Since the number of threads  
in the kernel at one time is relatively small, this is currently only  
a linked list of entries, but it would be trivial to also put the  
locks into a hashtable with 2*num_online_cpus() buckets and avoid most  
hash chains.

This was implemented by having a per-filesystem ->dir_lock() and - 
>dir_unlock() method, defaulting to ->mutex_lock() and - 
>mutex_unlock() if unspecified.

For the NFS case, I took a different approach, I side-step the
lock.  I don't think the parent directory lock in real_lookup() of
the magic directory for these special needs cases is particularly
useful.  The parent directory lock protects against the parent
directory being renamed (moved) or deleted while the lookup is in
progress.  (Is that correct?  What else does it do?)  What I did
instead is set a flag in the magic directory's inode's "i_flags".
Then in real_lookup(), test for that flag and use it to bypass
taking dir->i_mutex lock.  Rather than create a new flag, I hijacked
S_IMMUTABLE since the magic directory is conceptually in a way
immutable.  This solved the lock contention for my NFS work gaining
me about a 2.5X improvement.

The problem with using S_IMMUTABLE is that this can be changed by  
userspace, and if that happens in the middle of a lookup you will get  
imbalanced up/down calls and normal directories may become corrupt and/ 
or unusable.

I consider what's been done so far just a quick and dirty way to
work around a problem.  What I'd like to discuss though is if there
is any practical way to design a real change to the VFS layer that
would be acceptable by the community that can either bypass the
dir->i_mutex lock for special cases like mentioned above or just
design out the bottleneck.  Or is there a radically different
approach for implementing open-by-* functionality that would bypass
the real_lookup() bottleneck?  Any suggestions?


For illustrative purposes, below is my real_lookup() patch.  I
noticed that the S_IMMUTABLE flag is already being set on some
directories by the kernel in the "/proc" fs code.  Any particular
reason this was done?  Would my patch cause any bad side-effects
for the /proc file system, or would actually help its performance
bypassing the dir->i_mutex lock for those directories?

Quentin

--- linux-2.6.32.5/fs/namei.c	2010-01-22 17:23:21.000000000 -0600
+++ linux-2.6.32.5-immdir/fs/namei.c	2010-01-24 23:10:43.000000000  
-0600
@@ -478,8 +478,13 @@ static struct dentry * real_lookup(struc
{
	struct dentry * result;
	struct inode *dir = parent->d_inode;
+	int dir_locked = 0;
+
+	if (likely(!IS_IMMUTABLE(dir))) {
+		mutex_lock(&dir->i_mutex);
+		dir_locked = 1;
+	}

-	mutex_lock(&dir->i_mutex);
	/*
	 * First re-do the cached lookup just in case it was created
	 * while we waited for the directory semaphore..
@@ -513,7 +518,8 @@ static struct dentry * real_lookup(struc
				result = dentry;
		}
out_unlock:
-		mutex_unlock(&dir->i_mutex);
+		if (likely(dir_locked))
+			mutex_unlock(&dir->i_mutex);
		return result;
	}

@@ -521,7 +527,8 @@ out_unlock:
	 * Uhhuh! Nasty case: the cache was re-populated while
	 * we waited on the semaphore. Need to revalidate.
	 */
-	mutex_unlock(&dir->i_mutex);
+	if (likely(dir_locked))
+		mutex_unlock(&dir->i_mutex);
	if (result->d_op && result->d_op->d_revalidate) {
		result = do_revalidate(result, nd);
		if (!result)
Quentin

--
To unsubscribe from this list: send the line "unsubscribe linux- 
fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html