On Wed, Nov 28, 2012 at 6:04 PM, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > > Interesting. The code *has* the block size (it's in "bh->b_size"), but > it actually then uses the inode blocksize instead, and verifies the > two against each other. It could just have used the block size > directly (and then used the inode i_blkbits only when no buffers > existed), avoiding that dependency entirely.. Looking more at this code, that really would be the nicest solution. There's two cases for the whole get_block() thing: - filesystems. The block size will not change randomly, and "get_block()" seriously depends on the block size. - the raw device. The block size *will* change, but to simplify the problem, "get_block()" is a 1:1 mapping, so it doesn't even care about the block size because it will always return "bh->b_blocknr = nr". So we *could* just say that all the fs/buffer.c code should use "inode->i_blkbits" for creating buffers (because that's the size new buffers should always use), but use "bh->b_size" for any *existing* buffer use. And looking at it, it's even simple. Except for one *very* annoying thing: several users really don't want the size of the buffer, they really do want the *shift* of the buffer size. In fact, that single issue seems to be the reason why "inode->i_blkbits" is really used in fs/buffer.c. Otherwise it would be fairly trivial to just make the pattern be just a simple if (!page_has_buffers(page)) create_empty_buffers(page, 1 << inode->i_blkbits, 0); head = page_buffers(page); blocksize = head->b_size; and just use the blocksize that way, without any other games. All done, no silly WARN_ON() to verify against some global block-size, and the fs/buffer.c code would be perfectly simple, and would have no problem at all with multiple different blocksizes in different pages (the page lock serializes the buffers and thus the blocksize at the per-page level). But the fact that the code wants to do things like block = (sector_t)page->index << (PAGE_CACHE_SHIFT - bbits); seriously seems to be the main thing that keeps us using 'inode->i_blkbits'. Calculating bbits from bh->b_size is just costly enough to hurt (not everywhere, but on some machines). Very annoying. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html