On Mon, Aug 11, 2014 at 02:32:58PM -0400, Theodore Ts'o wrote: > On Mon, Aug 11, 2014 at 11:05:09AM -0700, Darrick J. Wong wrote: > > > > Using the bitmap turns out to be pretty quick (~130us to start RA for 4 groups > > vs. ~70us per group if I issue the RA directly). Each fadvise call seems to > > cost us ~1ms, so I'll keep using the bitmap to minimize the number of fadvise > > calls, since it's also a lot less code. > > 4 groups? Since the default flex_bg size is 16 block groups, I would > have expected that you would want to start RA every 16 groups. I was expecting 16 groups (32M readahead) to win, but as the observations in my spreadsheet show, 2MB tends to win. I _think_ the reason is that if we encounter indirect map blocks or ETB blocks, they tend to be fairly close to the file blocks in the block group, and if we're trying to do a large readahead at the same time, we end up with a largeish seek penalty (half the flexbg on average) for every ETB/map block. I figured out what was going on with the 1TB SSD -- it has a huge RAM cache big enough to store most of the metadata. At that point, reads are essentially free, but readahead costs us ~1ms per fadvise call. If you use a RA buffer that's big enough that there aren't many fadvise calls then you still come out ahead (ditto if you shove the RA into a separate thread) but otherwise the fadvise calls add up, badly. Actually, I'd considered using a default of flexbg_size * itable_size, but (a) the USB results are pretty bad for 32M v. 2M, and (b) I was thinking that 2MB of readahead might be small enough that we could enable it by default without having to worry about the mal-effects of parallel e2fsck runs. A logical next step might be to do ETB/block map readahead, but let's keep it simple for now. I should have time to update the spreadsheet to reflect performance of the new bitmap code while I go mess with fixing the jbd2 problems. > (And BTW, I've been wondering whether we should increase the flex_bg > size for bigger file systems. By the time we get to 4TB disks, Having > a flex_bg every 2GB seems a little small.) :) --D > > - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html