Jan Kara <jack@xxxxxxx> writes: > On Tue 24-01-12 15:13:40, Jeff Moyer wrote: >> Jan Kara <jack@xxxxxxx> writes: >> >> > On Tue 24-01-12 14:14:14, Jeff Moyer wrote: >> >> Chris Mason <chris.mason@xxxxxxxxxx> writes: >> >> >> >> >> All three filesystems use the generic mpages code for reads, so they >> >> >> all get the same (bad) I/O patterns. Looks like we need to fix this up >> >> >> ASAP. >> >> > >> >> > Can you easily run btrfs through the same rig? We don't use mpages and >> >> > I'm curious. >> >> >> >> The readahead code was to blame, here. I wonder if we can change the >> >> logic there to not break larger I/Os down into smaller sized ones. >> >> Fengguang, doing a dd if=file of=/dev/null bs=1M results in 128K I/Os, >> >> when 128KB is the read_ahead_kb value. Is there any heuristic you could >> >> apply to not break larger I/Os up like this? Does that make sense? >> > Well, not breaking up I/Os would be fairly simple as ondemand_readahead() >> > already knows how much do we want to read. We just trim the submitted I/O to >> > read_ahead_kb artificially. And that is done so that you don't trash page >> > cache (possibly evicting pages you have not yet copied to userspace) when >> > there are several processes doing large reads. >> >> Do you really think applications issue large reads and then don't use >> the data? I mean, I've seen some bad programming, so I can believe that >> would be the case. Still, I'd like to think it doesn't happen. ;-) > No, I meant a cache thrashing problem. Suppose that we always readahead > as much as user asks and there are say 100 processes each wanting to read 4 > MB. Then you need to find 400 MB in the page cache so that all reads can > fit. And if you don't have them, reads for process 50 may evict pages we > already preread for process 1, but process one didn't yet get to CPU to > copy the data to userspace buffer. So the read becomes wasted. Yeah, you're right, cache thrashing is an issue. In my tests, I didn't actually see the *initial* read come through as a full 1MB I/O, though. That seems odd to me. >> > Maybe 128 KB is a too small default these days but OTOH noone prevents you >> > from raising it (e.g. SLES uses 1 MB as a default). >> >> For some reason, I thought it had been bumped to 512KB by default. Must >> be that overactive imagination I have... Anyway, if all of the distros >> start bumping the default, don't you think it's time to consider bumping >> it upstream, too? I thought there was a lot of work put into not being >> too aggressive on readahead, so the downside of having a larger >> read_ahead_kb setting was fairly small. > Yeah, I believe 512KB should be pretty safe these days except for > embedded world. OTOH average desktop user doesn't really care so it's > mostly servers with beefy storage that care... (note that I wrote we raised > the read_ahead_kb for SLES but not for openSUSE or SLED (desktop enterprise > distro)). Fair enough. Cheers, Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html