On Aug 04, 2008 11:03 +1000, Neil Brown wrote: > So, given that background, it is possible to see some more possible > solutions (other than the ones already mentioned). > > - drop the internal lock across filldir. > It could be seen a impolite to hold any locks across a callback > that are not documented as being held. > This would put an extra burden on the filesystem, but it shouldn't > be a particularly big burden. > A filesystem needs to be able to find the 'next' entry from a > sequential 'seek' pointer so that is the most state that needs to > be preserved. It might be convenient to be able to keep more state > (pointers into data structures etc). All this can be achieved with > fairly standard practice: > a/ have a 'version' number per inode which is updated when any > internal restructure happens. > b/ when calling filldir, record the version number, drop the lock > call filldir, reclaim the lock, check the version number > c/ if the version number matches (as it mostly will) just keep > going. If it doesn't jump back to the top of readdir where > we decode the current seek address. > > Some filesystems have problems implementing seekdir/telldir so they > might not be totally happy here. I have little sympathy for such > filesystems and feel the burden should be on them to make it work. > > - use i_mutex to protect internal changes too, and drop i_mutex while > doing internal restructuring. This would need some VFS changes so > that dropping i_mutex would be safe. It would require some way to > lock an individual dentry. Probably you would lock it under > i_mutex by setting a flag bit, wait for the flag on some inode-wide > waitqueue, and drop the lock by clearing the flag and waking the > waitqueue. And you are never allowed to look at ->d_inode if the > lock flag is set. When we were working on scaling the performance of concurrent operations in a single directory we added hashed dentry locks instead of using i_mutex (well, i_sem in those days) to lock the whole directory. To make the change manageable we replaced direct i_sem locking on the directory inode with ->lock_dir() and ->unlock_dir() methods, defaulting to just down() and up() on i_sem, but replacing this with a per-entry lock on the child dentry hash. This allowed Lustre servers to create/lookup/rename/remove many entries in a single directory concurrently, and I think this same approach could be useful in this case also. This allows filesystems that need it to bypass i_mutex if they need their own brand of locking, while leaving the majority of filesystems untouched. It also has the benefit that filesystems that need improved multi-threaded performance in a single directory (e.g. JFFS2, XFS, or HPC or MTA workloads) have the ability to do it. There is definitely some work needed internal to the filesystem to take advantage of this increased parallelism, and we did implement such changes for ext3+htree directories, adding internal locking on each leaf block that scaled the concurrency with the size of the directory. Alas, we don't have any up-to-date kernel patches for this, though the VFS patch was posted to LKML back in Feb 2005 as "RFC: pdirops: vfs patch" http://www.mail-archive.com/linux-fsdevel@xxxxxxxxxxxxxxx/msg01617.html We have better dynamic locking code today (one of Jan's objections about that patch), but the VFS part of the patch is no longer maintained. The ext3+htree patch was also posted "[RFC] parallel directory operations" http://www.ussg.iu.edu/hypermail/linux/kernel/0307.1/att-0041/03-ext3-pdirops.patch Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html