On Tue, Sep 18, 2007 at 06:06:52PM -0700, Linus Torvalds wrote: > > especially as the Linux > > kernel limitations in this area are well known. There's no "16K mess" > > that SGI is trying to clean up here (and SGI have offered both IA64 and > > x86_64 systems for some time now, so not sure how you came up with that > > whacko theory). > > Well, if that is the case, then I vote that we drop the whole patch-series > entirely. It clearly has no reason for existing at all. > > There is *no* valid reason for 16kB blocksizes unless you have legacy > issues. Ok, let's step back for a moment and look at a basic, fundamental constraint of disks - seek capacity. A decade ago, a terabyte of filesystem had 30 disks behind it - a seek capacity of about 6000 seeks/s. Nowdays, that's a single disk with a seek capacity of about 200/s. We're going *rapidly* backwards in terms of seek capacity per terabyte of storage. Now fill that terabyte of storage and index it in the most efficient way - let's say btrees are used because lots of filesystems use them. Hence the depth of the tree is roughly O((log n)/m) where m is a factor of the btree block size. Effectively, btree depth = seek count on lookup of any object. When the filesystem had a capacity of 6,000 seeks/s, we didn't really care if the indexes used 4k blocks or not - the storage subsystem had an excess of seek capacity to deal with less-than-optimal indexing. Now we have over an order of magnitude less seeks to expend in index operations *for the same amount of data* so we are really starting to care about minimising the number of seeks in our indexing mechanisms and allocations. We can play tricks in index compaction to reduce the number of interior nodes of the tree (like hashed indexing in the XFS ext3 htree directories) but that still only gets us so far in reducing seeks and doesn't help at all for tree traversals. That leaves us with the btree block size as the only factor we can further vary to reduce the depth of the tree. i.e. "m". So we want to increase the filesystem block size it improve the efficiency of our indexing. That improvement in efficiency translates directly into better performance on seek constrained storage subsystems. The problem is this: to alter the fundamental block size of the filesystem we also need to alter the data block size and that is exactly the piece that linux does not support right now. So while we have the capability to use large block sizes in certain filesystems, we can't use that capability until the data path supports it. To summarise, large block size support in the filesystem is not about "legacy" issues. It's about trying to cope with the rapid expansion of storage capabilities of modern hardware where we have to index much, much more data with a corresponding decrease in the seek capability of the hardware. > So get your stories straight, people. Ok, so let's set the record straight. There were 3 justifications for using *large pages* to *support* large filesystem block sizes The justifications for the variable order page cache with large pages were: 1. little code change needed in the filesystems -> still true 2. Increased I/O sizes on 4k page machines (the "SCSI controller problem") -> redundant thanks to Jens Axboe's quick work 3. avoiding the need for vmap() as it has great overhead and does not scale -> Nick is starting to work on that and has already had good results. Everyone seems to be focussing on #2 as the entire justification for large block sizes in filesystems and that this is an "SGI" problem. Nothing could be further from the truth - the truth is that large pages solved multiple problems in one go. We now have a different, better solution #2, so please, please stop using that as some justification for claiming filesystems don't need large block sizes. However, all this doesn't change the fact that we have a major storage scalability crunch coming in the next few years. Disk capacity is likely to continue to double every 12 months for the next 3 or 4 years. Large block size support is only one mechanism we need to help cope with this trend. The variable order page cache with large pages was a means to an end - it's not the only solution to this problem and I'm extremely happy to see that there is progress on multiple fronts. That's the strength of the Linux community showing through. In the end, I really don't care how we end up supporting large filesystem block sizes in the page cache - all I care about is that we end up supporting it as efficiently and generically as we possibly can. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html