After a couple of weeks of painstakingly slow progress, Tux3 now is close enough to having extents that I feel it is time to sit back, limber up the proverbial knuckle joints for some heavy typing, and launch yet another Tux3 design missive into the wild blue yonder. What is all this noise about extents and why is everybody doing it? And what is an extent anyway? Traditional filesystems record a pointer to each individual disk block, whereas modern extent-based filesystems record pointers to runs of disk blocks, each such extent having a starting address and a count of data blocks that store all or part of some file. A file may consist of many extents, and may be sparse, which requires that logical addresses also be stored, in order to know the position of each extent within a file. An extent can therefore be thought of as a triple {logical address, physical address, block count} or alternatively, as a pair {physical address, block count} indexed by a logical address. Tux3 uses the latter approach while Ext4 uses the former. Extents themselves are simple things: just add a count to a block pointer and you have an extent, which is just what we did in Tux3. The extent count is packed into 6 unused bits of a 64 bit physical block pointer. "For free" you could say. On the other hand, algorithms to handle extents are far from simple. It is amazing how much complexity jumped out of the woodwork when that innocent little count field appeared. For one thing, extents constitute variable sized logical objects so you can't store them in a radix tree as with the classic direct/indirect/double/triple scheme inherited by Ext2 from UFS. A considerably more complex b-tree scheme is required. Well, we had the b-trees anyway for other reasons (variable sized inodes, multiple block pointers per logical address) but that is just one of the issues. Extents introduce a large number of new boundary conditions into file allocation and access code. To illustrate, consider the complexity of rewriting a region of a file on top of some number of pre-existing extents, which can have gaps between them. We want to overwrite some extents and allocate new extents to fill in the gaps. Maybe we want to enlarge some existing extents where possible, or delete some of the pre-existing extents in favor of a larger, newly allocated contiguous region. Maybe beginning or end of the region we are writing lands in the middle of a pre-existing extent, and we have to handle that. This is a whole lot more detail to worry about than just writing a block at a time, no? Indeed it is. So why would we want to put ourselves through such pain anyway? The answer is: extents yield huge performance benefits. Recently, somebody said that extents are just a performance hack, and that is true, but then cache is just a performance hack too. Extents are simply the biggest performance optimization to come along for filesystems since... buffer cache. We get all of these benefits from extents: * Allocate an entire range of blocks at once - fewer trips into the allocator. * Each extent is allocated contiguously - fragmentation reduction without guesswork. * Assign disk addresses to entire ranges of a file write at once - fewer walks through the file index blocks. * At sync time, form larger and fewer bio transactions that can be merged more efficiently by the disk elevator and require less handling in the block IO path than individual blocks. * Fewer edits of file metadata - don't trash the cache. * More compact representation on disk - less metadata to load into memory and write back out to disk. * Lower ratio of metadata blocks to data blocks - less seeking on read. * More compact representation in memory - less cache pressure. * Extent data is handled in batches at each step: allocation, index editing, io submission, freeing, so code execution is well localized - better use of processor execution cache. To quantify some of these benefits, consider that a block-oriented filesystem like Ext3 must read one index block for each thousand data blocks, 4 MB of data, which takes about 16 milliseconds to read. If an extra seek is required to pick up the next index block, and then back to where the next data will be read, that is maybe 12 ms on top of the 16 ms linear transfer time, a 75% performance penalty. Tux3 indexes about 85 MB of data per index block, which translates into about 3% seek overhead for a large linear read. An even more dramatic improvement can be expected with file deletion. Ext3 has to read one metadata block per 4 MB of deleted data in order to find the blocks that should be freed. This overhead is about 1.5 seconds per Gigabyte, or 25 minutes per terabyte. Tux3 will reduce that to about .07 seconds per gigabyte or about 1.2 minutes per terabyte, using extents. In both cases, further savings can be had using larger extents, but Tux3's modest 64 block extent limit already promises upwards of 95% of the benefit that is to be had. That said, Tux3 may well add an optimization to support larger extents later. This is not very hard, it is just more code, but for the time being we have a better use in mind for those bits we saved (versioning). As mentioned above, the one big drawback of extents is that they are hard to code and get right. That is why they have only slowly appeared in our grassroots open source filesystems on Linux. Big iron Unix vendors have known about the benefits of extents for decades, and have had the workloads to showcase them. Well now we have those loads on Linux too. Time to get our collective act together. I did not get into the details of Tux3's extent design because this post is already long enough. But they are interesting, at least I think so, and do deserve a post of their own. Please stay tuned. And please keep in mind: Halloween is fast approaching, and therefore so is the Tux3 Halloween Cabal party. Regards, Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html