Tux3 has grown simpler as it has aged. Back in august when Tux3 was two weeks old, I described what seemed to be a fairly simple filesystem structure: http://kerneltrap.org/Linux/Tux3_Hierarchical_Structure It turned out that a lot of fluff could still be cut. The volume table was dropped for reasons I will get into below, and two internal structures, the version table and allocation bitmap, were recast as normal files. We now have: Superblock Inode table node* Inode table leaf* Inode* Inode attribute*! Data btree attribute Data btree node* Data btree leaf* Extent*! Versioned elements are marked by !, repeated elements by *. Now Tux3 is just a btree of btrees, an inode table btree from which file btrees descend. Here it is in pictures, courtesy of Hirofumi Ogawa's brilliant graphical Tux3 structure dumper: http://tux3.org/images/tux3.structure.png http://tux3.org/images/tux3.structure.svg The only structural elaborations still due to arrive are the log chain for atomic commit and "metablocks" into which frequently updated superblock fields will be moved, giving a form that is likely to serve well for a long time: Superblock Metablock* log chain => log block => log block... Inode table node* Inode table leaf* Inode* Inode attribute*! Data btree attribute Data btree node* Data btree leaf* Extent*! To be sure, one could simplify even further by having just a single btree that includes file data in a single unified key space as Hammer does. But the btree of btrees arrangement maps better to Linux's cache design: we cache inodes, so file operations normally deal with a very shallow btree referenced by a cached inode instead of a btree deep enough to map every object on an entire volume. We are also able to use smaller btree keys for both inode table and files, making the btrees shallower yet. So to me, what we have arrived at feels like the correct balance of "as simple as possible, but no simpler". We now use regular files to represent a number of structural elements: * Allocation map * Version table * Atime table * Xattr atom table Regular files have a number of benefits of "through coded" filesystem structures, including no special purpose code to write and maintain, cache index tree maintained by the VFS, no special allocation handling, no extra filesystem checking and less to worry about when we get to fancy things like filesystem shrinking. Using a regular file as the allocation map seemed kind of unusual at first, but now seems completely natural. Recursions came up, but they would also have come up if the allocation map were a separate structure. Any allocation map that is itself dynamically allocated will have allocation recursions. We use cache and logging techniques to solve these, that are also needed for atomic commit and performance reasons, so no special hack was required. Most of the original design survived intact, even in detail. The variable inode attribute approach worked out well. Inode attributes are serially encoded with a four bit tag, enough to cover all standard inode attributes and a few new ones. This has yielded an exceptionally compact inode represention: currently 54 bytes, and further shrinkage is possible. Variable length attributes are supported, which are currently used for extended attributes and later will be used to encode small files. Most of the structural complexity in Tux3 is located in the btree leaves, particularly file data btree leaves, our dleaf blocks. These use a compact format that is easy to read but tricky to edit, and became more tricky when we added extents. Adding versions to the extents will make that code trickier yet... a lot more tricky. In classic Unix style, we fought this complexity by introducing a stream abstraction: when IO gets out of control, turn it into a stream. Now the high level code is easy to work with and still pretty close to the metal. So what happened to the volume table? I initially thought that multiple subvolumes per physical volume was a great idea, as it would require little more effort than adding a table of tree roots sharing a single allocation map. The hidden cost is sync: suppose you have one big, busy volume, and one small, lightly used volume, and you sync the small one. You don't expect to wait while the big volume syncs as well, do you? But that is exactly what will happen if they share the same allocation map, which has to be synced to disk, and because it must represent a consistent state for all subvolumes, all subvolumes have to be synced too. There are various ways to fix this. For example, each subvolume could have its own allocation space, and for efficiency, allocate space on the volume in chunks. Gee, that sounds like LVM, doesn't it? In the end, I decided that subvolumes are nothing more than LVM volumes, and that is how they should be implemented. If there are issues with LVM and LVM integration with filesystems, that's what we should be fixing, not trying to turn our filesystem into a volume manager. More details here: http://kerneltrap.org/mailarchive/tux3/2008/8/30/3134124 "Feature interaction between multiple volumes and atomic update" Regards, Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html