On Jun 11, 2006 18:00 +0200, Arjan van de Ven wrote: > On Fri, 2006-06-09 at 21:44 +0100, Alan Cox wrote: > > OTOH the number of complaints about this is minimal, people want to go > > forwards in a controlled manner not backwards. > > well... they want to be able to go "a little bit" backwards; say one > version of an OS (6 months). Eg the scenario that ought to work is "go > to newer version, hate it, go back". But yes that's a limited time to go > back, not the "go back to 2.2" kind of "go back". Interestingly, one of the reasons we want(ed) to get the extents code into the ext3 mainline ASAP is that this would allow it to be available for the "go back" phase when (in a couple of years) you NEED to have support for gigantic block devices and have no choice but use this code to update. For today it would only be used by people who really want to use it. On Jun 11, 2006 18:02 +0200, Arjan van de Ven wrote: > On Fri, 2006-06-09 at 14:51 -0400, Jeff Garzik wrote: > > PRECISELY. So you should stop modifying a filesystem whose design is > > admittedly _not_ modern! > > > > ext3 is already essentially xiafs-on-life-support, when you consider > > today's large storage systems and today's filesystem technology. Just > > look at the ugly hacks needed to support expanding an ext3 filesystem > > online. > > actually I think I disagree with you. One thing I've noticed over the > years is that ext2 layout has one thing going for it: it is simple and > robust. Maybe "ext2 layout" is the wrong word, "block bitmap and > direct/indirect block based" may be better. It seems that once you go > into tree space (and I would call htree a borderline thing there) you > get both really complex code and fragile behavior all over (mostly in > terms of "when something goes wrong") You're correct in calling htree a borderline case, because the directory metadata is still accessible in a "linear" manner if the tree is corrupted for some reason. I've recently been thinking of making the structure even more robust by encoding a singly- or doubly-linked list into the directory leaf blocks. However, in the direct/indirect block tree is the most fragile part of ext2/ext3. It also has the bad effect that corruption in the file indirect tree can easily amplify into widespread filesystem corruption because wrongly freeing indirect block and reallocating it will potentially cause 1024 more blocks to be freed when that indirect block is unlinked, etc. This is also the slowest part of e2fsck checking if it detects corruption (duplication) in the block allocation. When we had very small filesystems it was easy to tell if an indirect block was corrupt, because the valid block numbers made up only a small fraction of the 2^32 possible block numbers. However, with large filesystems valid block numbers make up a large fraction of the 2^32 block number space. As we get to 16TB filesystems it is impossible to tell when an indirect block is filled with garbage and when it is valid. One of the features of the extent format is that firstly it has a magic number in each "indirect" block (called an extent index block). Secondly, there is enough redundancy that it allows internal validation of the extent data (e.g. that extents are sequentially increasing logical offsets, that the parent's logical offset is correctly "encompassing" all of the leaf's logical offsets. Finally, one of the features that has been designed into the extent format (though not yet implemented) is that it is possible to add a checksum to each extent index to verify the metadata more strongly. There will also be space to have a back-pointer to the parent inode for validation. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html