Re: [RFC] Reinstate NFS exportability for JFFS2.

Andreas Dilger <adilger@xxxxxxx> · Sun, 17 Aug 2008 11:22:06 -0700

On Aug 04, 2008  11:03 +1000, Neil Brown wrote:
> So, given that background, it is possible to see some more possible
> solutions (other than the ones already mentioned).
> 
>  - drop the internal lock across filldir.
>    It could be seen a impolite to hold any locks across a callback
>    that are not documented as being held.
>    This would put an extra burden on the filesystem, but it shouldn't
>    be a particularly big burden.
>    A filesystem needs to be able to find the 'next' entry from a
>    sequential 'seek' pointer so that is the most state that needs to
>    be preserved.  It might be convenient to be able to keep more state
>    (pointers into data structures etc).  All this can be achieved with
>    fairly standard practice:
>      a/ have a 'version' number per inode which is updated when any
>         internal restructure happens.
>      b/ when calling filldir, record the version number, drop the lock
>         call filldir, reclaim the lock, check the version number
>      c/ if the version number matches (as it mostly will) just keep
>         going.  If it doesn't jump back to the top of readdir where
>         we decode the current seek address.
> 
>    Some filesystems have problems implementing seekdir/telldir so they
>    might not be totally happy here.  I have little sympathy for such
>    filesystems and feel the burden should be on them to make it work.
> 
>  - use i_mutex to protect internal changes too, and drop i_mutex while
>    doing internal restructuring.   This would need some VFS changes so
>    that dropping i_mutex would be safe.  It would require some way to
>    lock an individual dentry.  Probably you would lock it under
>    i_mutex by setting a flag bit, wait for the flag on some inode-wide
>    waitqueue, and drop the lock by clearing the flag and waking the
>    waitqueue. And you are never allowed to look at ->d_inode if the
>    lock flag is set.

When we were working on scaling the performance of concurrent operations
in a single directory we added hashed dentry locks instead of using
i_mutex (well, i_sem in those days) to lock the whole directory.  To make
the change manageable we replaced direct i_sem locking on the directory
inode with ->lock_dir() and ->unlock_dir() methods, defaulting to just
down() and up() on i_sem, but replacing this with a per-entry lock on
the child dentry hash.

This allowed Lustre servers to create/lookup/rename/remove many entries
in a single directory concurrently, and I think this same approach could
be useful in this case also.  This allows filesystems that need it to
bypass i_mutex if they need their own brand of locking, while leaving
the majority of filesystems untouched.

It also has the benefit that filesystems that need improved multi-threaded
performance in a single directory (e.g. JFFS2, XFS, or HPC or MTA
workloads) have the ability to do it.  There is definitely some work
needed internal to the filesystem to take advantage of this increased
parallelism, and we did implement such changes for ext3+htree directories,
adding internal locking on each leaf block that scaled the concurrency
with the size of the directory.

Alas, we don't have any up-to-date kernel patches for this, though the VFS
patch was posted to LKML back in Feb 2005 as "RFC: pdirops: vfs patch"
http://www.mail-archive.com/linux-fsdevel@xxxxxxxxxxxxxxx/msg01617.html
We have better dynamic locking code today (one of Jan's objections about
that patch), but the VFS part of the patch is no longer maintained.  The
ext3+htree patch was also posted "[RFC] parallel directory operations"
http://www.ussg.iu.edu/hypermail/linux/kernel/0307.1/att-0041/03-ext3-pdirops.patch

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html