On 2011-02-25, at 2:15 AM, Rogier Wolff wrote: > I must say I haven't read all of the large amounts of text in this > discussion. We don't write it to be read, just for fun :-). > But what I understand is that you're suggesting that we implement > larger blocksizes on the device, while we have to maintain towards the > rest of the kernel that the blocksize is no larger than 4k, because > the kernel can't handle that. > > Part of reasoning why this should be like this comes from the > assumption that each block group has just one block worth of bitmap. > That is IMHO the "outdated" assumption that needs to go. What you are suggesting is a feature called "flex_bg", and already is implemented in ext4, which is why I referenced it in my email. > Then, especially on filesystems where many large files live, we can > emulate the "larger blocksize" at the filesystem level: We always > allocate 256 blocks in one go! This is something that can be > dynamically adjusted: You might stop doing this for the last 10% of > free disk space. That's exactly what I wrote. > Now, you might say: How does this help with the performance problems > mentioned in the introduction? Well. reading 16 block bitmaps from 16 > block groups will cost a modern harddrive on average 16 * (7ms avg > seek + 4.1ms avg rot latency + 0.04ms transfer time), or about 170 ms. That is the time to load bitmaps in a non-flex_bg filesystem, which is the default for ext3-formatted filesystems. > Reading 16 block bitmaps from ONE block group will cost a modern > harddrive on average: 7ms avg seek + 4.1ms rot + 16*0.04ms xfer = > 11.2ms. That is an improvement of a factor of over 15... That is possible with flex_bg and a flex_bg factor of 16. That said, I don't think the kernel explicitly fetches all 16 bitmaps today, though it may have the benefit of a track cache on the disk. I think the correct number above is actually 11.8ms, not 11.2ms. In comparison, Ted's proposal would have an average access time of 7ms avg seek + 4.1ms rot + 0.04ms xfer = 11.14ms which is not a significant savings. > Now, whenever you allocate blocks for a file, just zap 256 bits at > once! Again the overhead of handling 255 more bits in memory is > trivial. > > I now see that andreas already suggested something similar but still > different. I'm not quite sure how your proposal is different, once you understand what a flex_bg is. > Anyway: Advantages that I see: > > - the performance benefits sougth for. > > - a more sensible number of block groups on filesystems. (my 3T > filessytem has 21000 block groups!) > > - the option of storing lots of small files without having to make > a fs-creation-time choice. > > - the option of improving defrag to "make things perfect". (allocation > strategy may be: big files go in big-files-only block groups and > their tails go in small-files-only block groups. Or if you think > big files may grow, tails go in big-files-only block groups. Whatever > you chose, defrag may clean up a fragpoint and or some unallocated > space when after a while it's clear that a big file will no longer > grow, and is just an archive). > > Roger. > > > On Fri, Feb 25, 2011 at 01:21:58AM -0700, Andreas Dilger wrote: >> On 2011-02-24, at 7:56 PM, Theodore Ts'o wrote: >>> = Problem statement = > > -- > ** R.E.Wolff@xxxxxxxxxxxx ** http://www.BitWizard.nl/ ** +31-15-2600998 ** > ** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 ** > *-- BitWizard writes Linux device drivers for any device you may have! --* > Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement. > Does it sit on the couch all day? Is it unemployed? Please be specific! > Define 'it' and what it isn't doing. --------- Adapted from lxrbot FAQ Cheers, Andreas -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html