On Fri, 29 Dec 2006, Andrew Morton wrote: > > Adam Richter spent considerable time a few years ago trying to make the > mpage code go direct-to-BIO in all cases and we eventually gave up. The > conceptual layering of page<->blocks<->bio is pretty clean, and it is hard > and ugly to fully optimise away the "block" bit in the middle. Using the buffer cache as a translation layer to the physical address is fine. That's what _any_ block device will do. I'm not at all sayign that "buffer heads must go away". They work fine. What I'm saying is that - if you index by buffer heads, you're screwed. - if you do IO by starting at buffer heads, you're screwed. Both indexing and writeback decisions should be done at the page cache layer. Then, when you actually need to do IO, you look at the buffers. But you start from the "page". YOU SHOULD NEVER LOOK UP a buffer on its own merits, and YOU SHOULD NEVER DO IO on a buffer head on its own cognizance. So by all means keep the buffer heads as a way to keep the "virtual->physical" translation. It's what they were designed for. But they were _originally_ also designed for "lookup" and "driving the start of IO", and that is wrong, and has been wrong for a long time now, because - lookup based on physical address is fundamentally slow and inefficient. You have to look up the virtual->physical translation somewhere else, so it's by design an unnecessary indirection _and_ that "somewere else" is also by definition filesystem-specific, so you can't do any of these things at the VFS layer. Ergo: anything that needs to look up the physical address in order to find the buffer head is BROKEN in this day and age. We look up the _virtual_ page cache page, and then we can trivially find the buffer heads within that page thanks to page->buffers. Example: ext2 vs ext3 readdir. One of them sucks, the other doesn't. - starting IO based on the physical entity is insane. It's insane exactly _because_ the VM doesn't actually think in physical addresses, or in buffer-sized blocks. The VM only really knows about whole pages, and all the VM decisions fundamentally have to be page-based. We don't ever "free a buffer". We free a whole page, and as such, doing writeback based on buffers is pointless, because it doesn't actually say anything about the "page state" which is what the VM tracks. But neither of these means that "buffer_head" itself has to go away. They both really boil down to the same thing: you should never KEY things by the buffer head. All actions should be based on virtual indexes as far as at all humanly possible. Once you do lookup and locking and writeback _starting_ from the page, it's then easy to look up the actual buffer head within the page, and use that as a way to do the actual _IO_ on the physical address. So the buffer heads still exist in ext2, for example, but they don't drive the show quite as much. (They still do in some areas: the allocation bitmaps, the xattr code etc. But as long as none of those have big VM footprints, and as long as no _common_ operations really care deeply, and as long as those data structures never need to be touched by the VM or VFS layer, nobody will ever really care). The directory case comes up just because "readdir()" actually is very common, and sometimes very slow. And it can have a big VM working set footprint ("find"), so trying to be page-based actually really helps, because it all drives things like writeback on the _right_ issues, and we can do things like LRU's and writeback decisions on the level that really matters. I actually suspect that the inode tables could benefit from being in the page cache too (although I think that the inode buffer address is actually "physical", so there's no indirection for inode tables, which means that the virtual vs physical addressing doesn't matter). For directories, there definitely is a big cost to continually doing the virtual->physical translation all the time. Linus - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html