On Tue, 28 Jan 2014, Linus Torvalds wrote: > On Tue, Jan 28, 2014 at 3:11 PM, Mikulas Patocka > <mikulas@xxxxxxxxxxxxxxxxxxxxxxxx> wrote: > > HPFS needs to load 4 consecutive 512-byte sectors when accessing the > > directory nodes or bitmaps. We can't switch to 2048-byte block size > > because files are allocated in the units of 512-byte sectors. > > Bah, this is untrue. > > Well, it's true that you cannot *switch* to another size, but the > buffer head layer should be perfectly happy with mixed sizes within a > device, even if nobody happens to do it. > > Just allocate a whole page, and make *that* page use 2048-byte buffers. > > So you should be perfectly able to just do > > struct buffer_head *bh = __bread(dev, nr, 2048); > > which gets and reads a single 2048-byte buffer head. > > Now, the problem is that because nobody actually does this, I bet we > have bugs in this area, and some path ends up using > bd_inode->i_blkbits instead of the passed-in size. A very quick look > finds __find_get_block -> __find_get_block_slow() looking bad, for > example. > > But I also bet that that should be easy to fix. In fact, I think the > only reason we use "i_blkbits" there is because it avoids a division > (and nobody had a *reason* to do it), but since this is the "we have > to do IO" path, just passing in the size and then using a > "sector_div()" is a nobrainer from a performance standpoint, I think. > So fixing that problem looks like a couple of lines. > > Now, another issue is that with multiple block sizes, it's up to the > filesystem to then guarantee that there isn't aliasing between two > physical blocks (eg say "2048b sized block at offset 1" vs "512b > buffer-head at offset 4"). But if the aliasing is fairly naturally > avoided at the FS level (and if this is done only for particular parts > of the filesystem, that should be very natural), that shouldn't be a > problem either. > > So I'd actually much rather see us taking advantage of multiple buffer > sizes and use a *native* 2k buffer-head when it makes sense, than this > odd "let's allocate them, and then maybe they are all properly aligned > anyway" kind of voodoo programming. > > Would you be willing to try? > > Linus > The page cache doesn't handle different-size buffers for one page. HPFS has some 2kB structures (dnodes, bitmaps) and some 512-byte structures (fnodes, anodes). We can have a 4kB page that contains one 2kB dnode and four 512-byte anodes or fnodes. That is impossible to create with create_empty_buffers. (you can't access 512-byte structures using 2kB buffers because that could trash file data that lie in the same 2kB area) __find_get_block_slow walks all the buffers on the page and finds the one that matches, that is fixable. But I have no clue how to replace create_empty_buffers. How much other code does depend on the fact that buffers completely cover a page and have the same size? There is a general problem - HPFS has very small user base, no one is really stress-testing it. I'm not stress-testing it neither - I have some data on HPFS, but I'm not running any intensive parallel workload on it. So even if we hacked the page cache to support different-size buffers on a single page, there would likely be bugs in it and no one would found those bugs for a long time because of tiny HPFS user base. I think it would be better to try this approach on some widely used filesystem and when it is debugged, I can switch HPFS for that. Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html