On Mon, Aug 11, 2014 at 04:10:30PM -0400, Theodore Ts'o wrote: > On Mon, Aug 11, 2014 at 11:55:32AM -0700, Darrick J. Wong wrote: > > I was expecting 16 groups (32M readahead) to win, but as the observations in my > > spreadsheet show, 2MB tends to win. I _think_ the reason is that if we > > encounter indirect map blocks or ETB blocks, they tend to be fairly close to > > the file blocks in the block group, and if we're trying to do a large readahead > > at the same time, we end up with a largeish seek penalty (half the flexbg on > > average) for every ETB/map block. > > Hmm, that might be an argument for not trying to increase the flex_bg > size, since we want to keep seek distances within a flex_bg to be > dominated by settling time, and not by the track-to-track > accelleration/coasting/deaccelleration time. It might not be too horrible of a regression, since the distance between tracks has gotten shorter and cylinders themselves have gotten bigger. I suppose you'd have to test a variety of flexbg sizes against a disk from, say, 5 years ago. If you know the size of the files you'll be storing at mkfs time (such as with the mk_hugefiles.c options) then increasing flexbg size is probably ok to avoid fragmenting. But yes, I was sort of enjoying how stuff within a flexbg gets (sort of) faster as disks get bigger. :) > > I figured out what was going on with the 1TB SSD -- it has a huge RAM cache big > > enough to store most of the metadata. At that point, reads are essentially > > free, but readahead costs us ~1ms per fadvise call. > > Do we understand why fadvise() takes 1ms? Is that something we can fix? > > And readahead(2) was even worse, right? >From the readahead(2) manpage: "readahead() blocks until the specified data has been read." The fadvise time is pretty consistently 1ms, but with readahead you have to wait for it to read everything off the disk. That's fine for threaded readahead, but for our single-thread readahead it's not much better than regular blocking reads. Letting the kernel do the readahead in the background is way faster. I don't know why fadvise takes so long. I'll ftrace it to see where it goes. --D > > - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html