In house, we developed our own variation of Theodore Ts'o and Robert Love's open-by-inode patch for the ext3 file system. The open-by-inode patch allows user space to open a file by its inode number under a special directory directly under the mount point called ".inode" (e.g.: "<>/.inode/12345"). I also adapted the concept for the NFS file system for open-by-file-handle using a magic directory ".file_handle". With the open-by-* extensions when used under load, we quickly noticed a performance problem in the VFS layer. We tracked it down to processes blocking on the taking of the dir->i_mutex lock in real_lookup(). That makes sense since the magic directories constantly get hammered with opens for "new" files that aren't yet in the dentry cache. We create normal dentry entries for the special files once opened, but there are so many unique files in the file systems that dentry cache hits are low enough that real_lookup() gets called often enough to be a real problem. As a hack for this bottleneck, the developer for the ext3 extension creates a 100 ".inodeXXX" directories that behave identically. Then in an app he sequentially spreads the open-by-inode requests among them. That dilutes the directory lock contention in real_lookup() enough for his situation. Depending on the number of disks configured, his ".inodeXXX" change gained him a considerable 5X-10X overall improvement to his application. For the NFS case, I took a different approach, I side-step the lock. I don't think the parent directory lock in real_lookup() of the magic directory for these special needs cases is particularly useful. The parent directory lock protects against the parent directory being renamed (moved) or deleted while the lookup is in progress. (Is that correct? What else does it do?) What I did instead is set a flag in the magic directory's inode's "i_flags". Then in real_lookup(), test for that flag and use it to bypass taking dir->i_mutex lock. Rather than create a new flag, I hijacked S_IMMUTABLE since the magic directory is conceptually in a way immutable. This solved the lock contention for my NFS work gaining me about a 2.5X improvement. I consider what's been done so far just a quick and dirty way to work around a problem. What I'd like to discuss though is if there is any practical way to design a real change to the VFS layer that would be acceptable by the community that can either bypass the dir->i_mutex lock for special cases like mentioned above or just design out the bottleneck. Or is there a radically different approach for implementing open-by-* functionality that would bypass the real_lookup() bottleneck? Any suggestions? For illustrative purposes, below is my real_lookup() patch. I noticed that the S_IMMUTABLE flag is already being set on some directories by the kernel in the "/proc" fs code. Any particular reason this was done? Would my patch cause any bad side-effects for the /proc file system, or would actually help its performance bypassing the dir->i_mutex lock for those directories? Quentin --- linux-2.6.32.5/fs/namei.c 2010-01-22 17:23:21.000000000 -0600 +++ linux-2.6.32.5-immdir/fs/namei.c 2010-01-24 23:10:43.000000000 -0600 @@ -478,8 +478,13 @@ static struct dentry * real_lookup(struc { struct dentry * result; struct inode *dir = parent->d_inode; + int dir_locked = 0; + + if (likely(!IS_IMMUTABLE(dir))) { + mutex_lock(&dir->i_mutex); + dir_locked = 1; + } - mutex_lock(&dir->i_mutex); /* * First re-do the cached lookup just in case it was created * while we waited for the directory semaphore.. @@ -513,7 +518,8 @@ static struct dentry * real_lookup(struc result = dentry; } out_unlock: - mutex_unlock(&dir->i_mutex); + if (likely(dir_locked)) + mutex_unlock(&dir->i_mutex); return result; } @@ -521,7 +527,8 @@ out_unlock: * Uhhuh! Nasty case: the cache was re-populated while * we waited on the semaphore. Need to revalidate. */ - mutex_unlock(&dir->i_mutex); + if (likely(dir_locked)) + mutex_unlock(&dir->i_mutex); if (result->d_op && result->d_op->d_revalidate) { result = do_revalidate(result, nd); if (!result) Quentin -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html