Re: topics for the file system mini-summit

Andreas Dilger <adilger@xxxxxxxxxxxxx> · Wed, 31 May 2006 23:36:28 -0600

On May 31, 2006  19:19 -0700, Valerie Henson wrote:
> I don't think a block group is a good enough fault isolation domain -
> think hard links.  What I think we need is normal file system
> structures when you are referencing stuff inside your fault isolation
> domain, and something more complicated if you have to reference stuff
> outside.  One of Arjan's ideas involves something we're calling
> continuation inodes - if the file's data is stored in multiple
> domains, it has a separate continuation inode in each domain, and each
> continuation inode has all the information necessary to run a full
> fsck on the data inside that domain.  Similarly, if a directory has a
> hard link to a file outside its domain, we'll have to allocate a
> continuation inode and dir entry block in the domain containing the
> file.  The idea is that you can run fsck on a domain without having to
> go look outside that domain.  You may have to clean up a few things in
> other domains, but they are easy to find and don't require an fsck in
> other domains.

This sounds very much like the approach Lustre has taken for clustered
metadata servers (CMD), which was developed as an advanced prototype
last year, and is being reimplemented for production now.

In "regular" (non-CMD) Lustre there is a single metadata target (MDT)
which holds all of the namespace (directories, filenames, inodes), and
the inodes have EA metadata that tells users of those files which other
storage targets (OSTs) hold the file data (RAID 0 stripe currently).
OSTs are completely self-contained ext3 filesystems, as is the MDT.

In the prototype CMD Lustre there are multiple metadata targets that
make up a single namespace.  Generally, each directory and the inodes
therein are kept on a single MDT but in the case of large directories (>
64k entries, which are split across MDTs by the hash of the filename),
hard links, or renames it is possible to have a cross-MDT inode reference
in a directory.

The cross-MDT reference is implemented by storing a special dirent
in the directory which tells the caller which other MDT actually has
the inode.  The remote inode itself is held in a private "MDT object"
directory so that it has a local filesystem reference and can be looked
up by a special filename that is derived from the inode number, and I
believe source MDT (either in the filename or the private directory)
to keep the link count correct.

This allows each MDT filesystem to be internally consIstent, and the
cross-MDT dirents are treated by e2fsck much the same as symlinks in
the sense that a dangling reference is non-fatal.  There is (or at
least was a design for the CMD prototype) a second-stage tool which
would get a list of cross-MDT references that it could correlate with
the MDT object directory inodes on the other MDTs and fix up refcounts
or orphaned inodes.

In the case of "split directories", which are implemented in order to
load-balance metadata operations across multiple MDTs there was also a
need to migrate directory entries to other MDTs when the directory
splits.  That was only done once when the dir grows beyond 64k, in order
to limit the number of cross-MDT entries in the directory and to get the
added parallelism involved as soon as possible.  After the initial split
new direntries and their inodes are created together within a single MDT,
though there are several directory "stripes" on multiple MDTs running in
parallel.

The same methods used to do dirent migration were also used for handling
renames across directories on multiple MDTs.  The basics are that there
needs to be separate target filesystem primitives exported for creating
and deleting a new inode, and adding or removing an entry from a directory
(which are bundled together in the Linux VFS).

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html