Hi filesystem aficionados, The Large Hadron Collider no longer has a monopoly on smashing atoms, now that Tux3 has weighed in with a bit of its own atom smashing. Probably the coolest idea that has seen the light of day so far in Tux3 is the concept of xattr atoms: numeric tokens that stand for xattr names. In contrast to pretty much every other filesystem out there, Tux3 looks up an xattr in two stages: 1) Look up the attribute name in the atom table 2) Search for the atom in the inode's xattr cache The sticky issue here is that there is no limit on the number of different xattr names that are possible, which means that there is no limit on the size of the atom table. The fact that the number of distinct names used in practice tends to be very small is, unfortunately, insufficient justification to ignore the fact that nothing prevents the atom table from growing without bound. Unless preventative measures are taken, any unprivileged user could carry out a denial of service attack simply by creating and deleting random xattrs on a file until the entire disk volume is filled with unused xattr names. Addressing this problem required a lot of soul searching and wild design banter. Finally we convinced ourselves that the only way to avoid the alligators in that marsh would be to implement full blown persistent reference counting for xattr names. A reverse mapping table would also be necessary in order to be able to implement listxattr efficiently. Though these things sound scary and complex, it ended up being implemented in a really small amount of nicely efficient code. Here is the main design thread: http://kerneltrap.org/mailarchive/tux3/2008/9/12/3274834/thread "More xattr design details" and here we obsess over the tradeoffs at a level of detail I will spare you from here: http://tux3.org/pipermail/tux3/2008-September/000143.html "The long and short of extended attributes" What is missing from those threads is the end result, alluded to here: http://tux3.org/pipermail/tux3/2008-September/000186.html "Atom refcounting redux" And which I can state more succinctly now that the job is done: the entire extended attribute support for Tux3 is coded in less than 500 lines, except for two pieces that will land later: * A hashed cache of xattr names sitting in front of the atom table. * Transaction support for atom table, reverse map and refcount updates. These two pieces should be about 300 lines at most, or 800 lines total for xattr support, of which maybe 40% is devoted to resolving and refcounting xattr names. The effect of the the hash table will be to tractor beam the xattr lookup performance up from dirop speed (microseconds) to memhash speed (nanoseconds). Transaction support will make all the updates including refcounting and reverse mapping atomic. With the help of Tux3's forward logging accelerator, the overhead for updating refcounts should be masked nicely by the unavoidable cost of getting the xattr data on to disk. In the rosiest scenario, we end up winning by transferring less xattr data to/from disk due to atoms being much smaller than typical xattr names. There is also one really esoteric hack going on to reduce the overhead of logging refcounts: each 32 bit count is divided into two 16 bit halves, with all the low halves stored on one set of map pages, and the high halves stored on a different set. Thus, when many xattrs are being updated at once there will be few carries into the high order pages and disk bandwidth required to update batches of refcounts will be reduced by about half. On top of that, we do not update the refcounts directly, but rather log refcount deltas logically and roll them up into the disk tables at relatively long intervals. Such log entries are tucked into a portion of the transaction commit block in such a way that we get the extra logging more or less for free, by using otherwise idle commit block space. I strongly suspect that Tux3 extended attributes will benchmark as well as existing implementations that just store xattr names directly and repetitively while requiring significantly less space to do it. If this proves out in practice (the test is not all that far in the future) then we are going to be happy about it, and most probably extend the ideas further. Tux3 atom support is not just about deduplicating xattr names: it is about deduplication in general. If it works for xattr names then it will work for bodies as well, and most likely file data too. If the implementation of this xattr atom idea had turned out to be a big rambling mess then I would most likely have just sighed, flushed it, and gone back to tried and true methods. But it did not turn out to be a big mess, quite the contrary. This was accomplished by reusing an existing component: Tux3 atoms are resolved via standard directory lookup (ext2_find_entry) which serves as a placeholder in Tux3 for a proper directory index that will arrive later. Instead of storing an inode number in the directory, we store an atom number. Another complexity-reduction measure involved mapping three atom-related tables into the same file: 1) Atom lookup table (Ext2 directory) 2) Atom refcount table 3) Atom reverse map table (for listxattr) The latter two tables are mapped at block 2^28 in the atom directory file, which has no particular adverse impact on a btree but may stress the kernel's page cache radix tree somewhat, which consequently is forced to be five levels deep as opposed to one or two for ordinary, non-sparse directory files. I doubt this will show up as a performance artifact, but if it does there are several ways to fix it, for example: * Fiddle the page cache radix tree to have a single level of btree index at the top level for sparse files. * Keep an array of direct pointers to the atom table when it is small, which it nearly always will be. * Store the three tables in three separate inodes, big deal. Coincidentally, I learned from plumbing the Btrfs repo that Josef Bacik similarly repurposed a directory mechanism in the direction of xattr support, but in a distinctly different way: Josef stores xattr bodies in a variant of the Btrfs directory (if I read it correctly) indexed by xattr name and inode. On the other hand, Tux3's approach just puts atom numbers in a single, global directory. Atom numbers happen to resemble inode numbers so strongly that ext2_find_entry just worked for this without alteration. We might even use the Ext2 "type" byte field, intended as an ls accelerate, for some purpose or other. Or maybe not! After all, the Ext2 directory scheme is just a placeholder in Tux3, until PHTree lands as described in the original design note: http://tux3.org/design.html "PHTree" It turns out that classic HTree is better for indexing atoms than the cute new PHTree design because there is no directory traversal and thus no physical stability requirement. Also there is no requirement for backward compatibility with Ext2 leaf format, so naturally we will use a more efficient format with a mini-btree mapped into the leaf block for that little extra performance edge and, we hope, nearly the same data density. Of course, the best thing about the Tux3 atom smashing is that we do not require low-temperature superconductors, so it should be considerably more reliable. Regards, Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html