Hi all, Since everybody seems to be having fun building new filesystems these days, I thought I should join the party. Tux3 is the spiritual and moral successor of Tux2, the most famous filesystem that was never released.[1] In the ten years since Tux2 was prototyped on Linux 2.2.13 we have all learned a thing or two about filesystem design. Tux3 is a write anywhere, atomic commit, btree based versioning filesystem. As part of this work, the venerable HTree design used in Ext3 and Lustre is getting a rev to better support NFS and possibly become more efficient. The main purpose of Tux3 is to embody my new ideas on storage data versioning. The secondary goal is to provide a more efficient snapshotting and replication method for the Zumastor NAS project, and a tertiary goal is to be better than ZFS. Tux3 is big endian as the Great Penguin intended. General Description In broad outline, Tux3 is a conventional node/file/directory design with wrinkles. A Tux3 inode table is a btree with versioned attributes at the leaves. A file is an inode attribute that is a btree with versioned extents at the leaves. Directory indexes are mapped into directory file blocks as with HTree. Free space is mapped by a btree with extents at the leaves. The interesting part is the way inode attributes and file extents are versioned. Unlike the currently fashionable recursive copy on write designs with one tree root per version[2], Tux3 stores all its versioning information in the leaves of btrees using the versioned pointer algorithms described in detail here: http://lwn.net/Articles/288896/ This method promises a significant shrinkage of metadata for heavily versioned filesystems as compared to ZFS and Btrfs. The distinction between Tux3 style versioning and WAFL style versioning used by ZFS and Btrfs is analogous to the distinction between delta and weave encoding for version control systems. In fact Tux3's pointer versioning algorithms were derived from a binary weave technique I worked on earlier for version control back in the days when we were all racing for the glory of replacing the proprietary Bitkeeper system for kernel version control.[3] Filesystem Characteristics and Limits * Versioning of individual files, directories or entire filesystem * Remote replication of single files, directories or entire filesystem * All versions (aka snapshots) writable * 2^60 maximum file size * 2^60 maximum volume size * 2^48 maximum versions * 2^48 maximum inodes * Variable sized, dynamically allocated inodes * New versioning method (Versioned pointers) - versioned extents for file/directory data - versioned standard attributes (e.g. mode, uid, mtime, size) - versioned extended attributes (including immediate file data) * New atomic update method (Forward logging) * New physically stable directory index (PHTree) * Btree backpointers for robust fsck Forward Logging Atomic update in Tux3 is via a new method called forward logging that avoids the double writes of conventional journalling and the recursive copy behavior of copy on write. Each update transaction is a series of data blocks followed by a commit block. Each commit block either stores the location where the next commit block will be if it is known, otherwise it is the end of a chain of commits. The start of each chain of commits is referenced from the superblock. Commit data may be logical or physical. Logical updates are first applied to the target structure in memory (usually a inode table block or file index block) then the changed blocks are logged physically before being either applied to the final destination or implicitly merged with the destination via logical logging. This multi level logging sounds like a lot of extra writing, but it is not because the logical updates are more compact than physical block updates and many of them can be logged with a periodic rollup pass to perform the physical or higher level logical updates. A typical write transaction therefore looks like a single data extent followed by one commit block, which can be written anywhere on the disk. This is about as efficient as it is possible for an atomic update to be. Fragmention Control Versioning filesystems on rotating media are prone to fragmentation. With a write-anywhere strategy, we have a great deal of latitude in choosing where to write, but translating that ability into minimzing read seeks is far from easy. For metadata, we can fall back to update in place using the forward logging method acting as a "write twice" journal. Because the metadata is small (at least when the filesystem is not heavily versioned) sufficient space of the single commit block required to logically log a metadata update will normally be available near the original location and the update in place fallback will not often be needed. When there is no choice but to write new data far away from the original location, a method called "write bouncing" is to be used, where the allocation target is redirected according to a generating function to a new target zone some distance away from the original one. If that zone is too crowded, then the next candidate zone further away will be checked and so on, until the entire volume has been checked for available space. (Analogous to a quadratic hash.) What this means is, even though seeking to changed blocks of a large file is not entirely avoided, at least it can be kept down to a few seeks in most cases. File readahead will help a lot here, because a number of outlying extents can be picked up in one pass. Physical readahead would help even more, to deal with cross-file fragmentation in a directory. With flash storage, seek time is essentially zero and transfer bandwidth is the dominant issue, at which Tux3 should excel. Inode attributes An inode is a variable sized item indexed by its inode number in the inode btree. It consists of a list of attributes blocks with standard attributes are grouped together according their frequency of updating, and extended attributes. Each standard attribute block carries a version label at which the attribute was last changed. Extended attributes have the same structure as files, and in fact file data is just an extended attribute. Extended attributes are not versioned in the inode, but at the index leaf blocks. The atime attribute is handled separately from the inode table to avoid polluting the inode table with versions generated by mere filesystem reads. Unversioned attributes: Block count - Block sharing makes it difficult to calculate so just give the total block count for the data attribute btree Standard attribute block: mode uid gid Write attribute block: size - Update with every extending write or truncate mtime - update with every change Data attribute: Either immediate file data or root of a btree index with versioned extents at the leaves Immediate data attributes: immediate file data symlink version link (see below) Unversioned reference to versioned attributes: xattrs - version:atom:datalen:data file/directory data Versioned link count: The inode can be freed when link counts of all versions are zero None of the above: atime - Update with every read - separate versioned btree Note: an inode is never reused unless it is free in all versions. Atom table Extended attributes are tagged with attribute "atoms" held in a global, unversioned atom table, to translate attribute names into compact numbers. Directory Index The directory index scheme for Tux3 is PHTree, which is a (P)physically stable variant of HTree, obtained by inserting a new layer of index blocks between the index nodes and the dirent blocks, the "terminal" index blocks. Each terminal index block is populated with [hash, block] pairs, each of which indicates that there is a dirent in <block> with hash <hash>. Thus there are two "leaf" layers in a PHTree: 1) the terminal nodes of the index btree and 2) the directory data blocks containing dirents. This requires one extra lookup per operation versus HTree, which is regretable, but it solves the physical stability problem that has caused so much grief in the past with NFS support. With PHTree, dirent blocks are never split and dirents are never moved, allowing the logical file offset to serve as a telldir/seekdir position just as it does for primitive filesystems like Ext2 and UFS, on which the Posix semantics of directory traversal are sadly based. There are other advantages to physical stability of dirents besides supporting brain damaged NFS directory traversal semantics: because dirents and inodes are initially allocated in the same order, a traversal of the leaf blocks in physical order to perform deletes etc, will tend to access the inodes in ascending order, which reduces cache thrashing of the inode table, a problem that has been observed in practice with schemes like HTree that traverse directories in hash order. Because leaf blocks in PHTree are typically full instead of 75% full as in HTree, the space usage ends up about the same. PHTree does btree node merging on delete (which HTree does not) so fragmentation of the hash key space is not a problem and a slightly less uniform but far more efficient hash function can be used, which should deliver noticeably better performance. HTree always allocates new directories entries into btree leaf nodes, splitting them if necessary, so it does not have to worry about free space management at all. PHTree does however, since gaps in the dirent blocks left by entry deletions have to be recycled. A linear scan for free space would be far too inefficient, so instead, PHTree uses a lazy method of recording the maximum sized dirent available in each directory block. The actual largest free dirent may be smaller, and this will be detected when a search fails, causing the lazy max to be updated. We can safely skip searching for free space in any block for which the lazy max is less than the needed size. One byte is sufficient for the lazy max, so one 4K block is sufficient to keep track of 2^12 * 2^12 bytes worth of directory blocks, a 16 meg directory with about half a million entries. For larger directories this structure becomes a radix tree, with lazy max recorded also at each index pointer for quick free space searching without having to examine every lazy map. Like HTree, a PHTree is a btree embedded in the logical blocks of a file. Just like a file, a directory is read into a page cache mapping as its blocks are accessed. Except for cache misses, the highly efficient page cache radix tree mechanism is used to resolve btree pointers, avoiding many file metadata accesses. A second consequence of storing directory indexes in files is that the same versioning mechanism that versions a file also versions a directory, so nothing special needs to be done to support namespace versioning in Tux3. Scaling to large number of versions What happens as number of versions becomes very large is something of a worry. Then a lot of metadata for unrelated versions may have to be loaded, searched and edited. A relatively balanced symmetric version tree can be broken up into a number of subtrees. Sibling subtrees cannot possibly affect each other. O(log(subtrees)) subtrees need to be loaded and operated on for any given version. What about scaling of completely linear version chain? Then data is heavily inherited and thus compressed. What if data is heavily versioned and therefore not inherited much? Then we should store elements in stable sort order and binsearch, which works well in this case because not many parents have to be searched. I have convinced myself that scaling to arbitrary numbers of versions will cause worst case slowdown of no more than O(log(versions)) using the method described above. When I say "large number of versions" I mean "more than a few hundred", so it is not an pressing issue for today's versioning filesystem applications, which mainly use their versioning capability to implement backup and replication. Filesystem expansion and shrinking (What could possibly go wrong?) Multiple Filesystems sharing the same Volume This is just a matter of providing multiple inode btrees sharing the same free tree. Not much of a challenge, and somebody may have a need for it. Is there really any advantage if the volume manager and filesystem already support on-demand expansion and shrinking? Quotas (Have not thought about it yet. Quotas should be comprehensive, fine grained and deadlock free.) New user interfaces for Version Control The standard method for accessing a particular version of a volume is via a version option on the mount command. But it is also possible to access file versions via several other methods, including a new variant of the open syscall with a version tag parameter. "Version transport" allows the currently mounted version to be changed to some other. All open files will continue to access the version under which they were opened, but newly opened files will have the new version. This is the "Git Cache" feature. Tux3 introduces the idea of a version link, similar to a symlink, but carrying a version tag to allow the named file to be opened for some other version than the currently mounted version. Like symlinks, it is not required that the referenced object be valid. Version links do not introduce any new inter-version consistency requirement, and are therefore robust. Unlike symlinks, version links are not followed by default. This makes it easy to implement a Netapp-like feature of a hidden .snapshot subdirectory in each directory through which periodic snapshots can be accessed by a user. Summary of data structures Superblock Only fixed fs attributes and pointers to metablocks Metablocks Like traditional superblocks, but containing only variable data Distributed across volume, all read on start Contain variable fields, e.g., forward logs Inode table Versioned standard attributes Versioned extended attributes Versioned data attribute Nonversioned file btree root Atime table Btree tree versioned at the terminal index nodes leaf blocks are arrays of 32 bit atimes Free tree Btree with extents at the leaves subtree free space hints at the nodes Atom table A btree much like a directory mapping attribute names to internal attribute codes (atoms). Maybe it should just be a directory like any other? Forward log commit block Hash of transaction data Magic Seq Rollup - which previous log entries to ignore Data blocks Directory Index Embedded in logical blocks of directory file, therefore automatically versioned Implementation Implementation work has begun. Much of the work consists of cutting and pasted bits of code I have developed over the years, for example, bits of HTree and ddsnap. The immediate goal is to produce a working prototype that cuts a lot of corners, for example block pointers instead of extents, allocation bitmap instead of free extent tree, linear search instead of indexed, and no atomic commit at all. Just enough to prove out the versioning algorithms and develop new user interfaces for version control. The biggest single piece of prototype work remaining to go from a simplified prototype to the filesystem described here is to extend the versioned pointer algorithms to handle versioned extents, a challenging bit of hacking indeed. Transaction management at the VFS method level is also expected to be a major cause headaches just as it was for Ext3. Btree methods are pretty much under control. The Tux3 project home is here: http://tux3.org/ A mailing list is here: http://tux3.org/cgi-bin/mailman/listinfo/tux3 All interested parties welcome. Hackers especially welcome. Prototype code proving the versioning algorithms is here: http://tux3.org/source/version.c A Mercurial tree is coming soon. Regards, Daniel [1] For the whole story: google "evil+patents+sighted" [2] Copy on write versioning, which I had a hand in inventing. [3] Linus won. A major design element of Git (the directory manifest) was due to me, and of course Graydon Hoare (google Quicksort) deserves more credit than anyone. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html