On May 31, 2006 19:19 -0700, Valerie Henson wrote: > I don't think a block group is a good enough fault isolation domain - > think hard links. What I think we need is normal file system > structures when you are referencing stuff inside your fault isolation > domain, and something more complicated if you have to reference stuff > outside. One of Arjan's ideas involves something we're calling > continuation inodes - if the file's data is stored in multiple > domains, it has a separate continuation inode in each domain, and each > continuation inode has all the information necessary to run a full > fsck on the data inside that domain. Similarly, if a directory has a > hard link to a file outside its domain, we'll have to allocate a > continuation inode and dir entry block in the domain containing the > file. The idea is that you can run fsck on a domain without having to > go look outside that domain. You may have to clean up a few things in > other domains, but they are easy to find and don't require an fsck in > other domains. This sounds very much like the approach Lustre has taken for clustered metadata servers (CMD), which was developed as an advanced prototype last year, and is being reimplemented for production now. In "regular" (non-CMD) Lustre there is a single metadata target (MDT) which holds all of the namespace (directories, filenames, inodes), and the inodes have EA metadata that tells users of those files which other storage targets (OSTs) hold the file data (RAID 0 stripe currently). OSTs are completely self-contained ext3 filesystems, as is the MDT. In the prototype CMD Lustre there are multiple metadata targets that make up a single namespace. Generally, each directory and the inodes therein are kept on a single MDT but in the case of large directories (> 64k entries, which are split across MDTs by the hash of the filename), hard links, or renames it is possible to have a cross-MDT inode reference in a directory. The cross-MDT reference is implemented by storing a special dirent in the directory which tells the caller which other MDT actually has the inode. The remote inode itself is held in a private "MDT object" directory so that it has a local filesystem reference and can be looked up by a special filename that is derived from the inode number, and I believe source MDT (either in the filename or the private directory) to keep the link count correct. This allows each MDT filesystem to be internally consIstent, and the cross-MDT dirents are treated by e2fsck much the same as symlinks in the sense that a dangling reference is non-fatal. There is (or at least was a design for the CMD prototype) a second-stage tool which would get a list of cross-MDT references that it could correlate with the MDT object directory inodes on the other MDTs and fix up refcounts or orphaned inodes. In the case of "split directories", which are implemented in order to load-balance metadata operations across multiple MDTs there was also a need to migrate directory entries to other MDTs when the directory splits. That was only done once when the dir grows beyond 64k, in order to limit the number of cross-MDT entries in the directory and to get the added parallelism involved as soon as possible. After the initial split new direntries and their inodes are created together within a single MDT, though there are several directory "stripes" on multiple MDTs running in parallel. The same methods used to do dirent migration were also used for handling renames across directories on multiple MDTs. The basics are that there needs to be separate target filesystem primitives exported for creating and deleting a new inode, and adding or removing an entry from a directory (which are bundled together in the Linux VFS). Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html