On Tue 24-01-12 15:59:02, Jeff Moyer wrote: > Jan Kara <jack@xxxxxxx> writes: > > On Tue 24-01-12 15:13:40, Jeff Moyer wrote: > >> Jan Kara <jack@xxxxxxx> writes: > >> > >> > On Tue 24-01-12 14:14:14, Jeff Moyer wrote: > >> >> Chris Mason <chris.mason@xxxxxxxxxx> writes: > >> >> > >> >> >> All three filesystems use the generic mpages code for reads, so they > >> >> >> all get the same (bad) I/O patterns. Looks like we need to fix this up > >> >> >> ASAP. > >> >> > > >> >> > Can you easily run btrfs through the same rig? We don't use mpages and > >> >> > I'm curious. > >> >> > >> >> The readahead code was to blame, here. I wonder if we can change the > >> >> logic there to not break larger I/Os down into smaller sized ones. > >> >> Fengguang, doing a dd if=file of=/dev/null bs=1M results in 128K I/Os, > >> >> when 128KB is the read_ahead_kb value. Is there any heuristic you could > >> >> apply to not break larger I/Os up like this? Does that make sense? > >> > Well, not breaking up I/Os would be fairly simple as ondemand_readahead() > >> > already knows how much do we want to read. We just trim the submitted I/O to > >> > read_ahead_kb artificially. And that is done so that you don't trash page > >> > cache (possibly evicting pages you have not yet copied to userspace) when > >> > there are several processes doing large reads. > >> > >> Do you really think applications issue large reads and then don't use > >> the data? I mean, I've seen some bad programming, so I can believe that > >> would be the case. Still, I'd like to think it doesn't happen. ;-) > > No, I meant a cache thrashing problem. Suppose that we always readahead > > as much as user asks and there are say 100 processes each wanting to read 4 > > MB. Then you need to find 400 MB in the page cache so that all reads can > > fit. And if you don't have them, reads for process 50 may evict pages we > > already preread for process 1, but process one didn't yet get to CPU to > > copy the data to userspace buffer. So the read becomes wasted. > > Yeah, you're right, cache thrashing is an issue. In my tests, I didn't > actually see the *initial* read come through as a full 1MB I/O, though. > That seems odd to me. At first sight yes. But buffered reading internally works page-by-page so what it does is that it looks at the first page it wants, sees we don't have that in memory, so we submit readahead (hence 128 KB request) and then wait for that page to become uptodate. Then, when we are coming to the end of preread window (trip over marked page), we submit another chunk of readahead... Honza -- Jan Kara <jack@xxxxxxx> SUSE Labs, CR -- dm-devel mailing list dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel