On Wed, May 27, 2009 at 12:06:12PM +0800, KOSAKI Motohiro wrote: > > > Ah. So it's likely to be some strange interaction with the RAID setup. > > > > The normal case is, if page N become uptodate at time T(N), then > > T(N) <= T(N+1) holds. But for RAID, the data arrival time depends on > > runtime status of individual disks, which breaks that formula. So > > in do_generic_file_read(), just after submitting the async readahead IO > > request, the current page may well be uptodate, so the page won't be locked, > > and the block device won't be implicitly unplugged: > > Hifumi-san, Can you get blktrace data and confirm Wu's assumption? To make the reasoning more obvious: Assume we just submitted readahead IO request for pages N ~ N+M, then T(N) <= T(N+1) T(N) <= T(N+2) T(N) <= T(N+3) ... T(N) <= T(N+M) (M = readahead size) So if the reader is going to block on any page in the above chunk, it is going to first block on page N. With RAID (and NFS to some degree), there is no strict ordering, so the reader is more likely to block on some random pages. In the first case, the effective async_size = M, in the second case, the effective async_size <= M. The more async_size, the more degree of readahead pipeline, hence the more low level IO latencies are hidden to the application. Thanks, Fengguang > > > > > if (PageReadahead(page)) > > page_cache_async_readahead() > > if (!PageUptodate(page)) > > goto page_not_up_to_date; > > //... > > page_not_up_to_date: > > lock_page_killable(page); > > > > > > Therefore explicit unplugging can help, so > > > > Acked-by: Wu Fengguang <fengguang.wu@xxxxxxxxx> > > > > The only question is, shall we avoid the double unplug by doing this? > > > > --- > > mm/readahead.c | 10 ++++++++++ > > 1 file changed, 10 insertions(+) > > > > --- linux.orig/mm/readahead.c > > +++ linux/mm/readahead.c > > @@ -490,5 +490,15 @@ page_cache_async_readahead(struct addres > > > > /* do read-ahead */ > > ondemand_readahead(mapping, ra, filp, true, offset, req_size); > > + > > + /* > > + * Normally the current page is !uptodate and lock_page() will be > > + * immediately called to implicitly unplug the device. However this > > + * is not always true for RAID conifgurations, where data arrives > > + * not strictly in their submission order. In this case we need to > > + * explicitly kick off the IO. > > + */ > > + if (PageUptodate(page)) > > + blk_run_backing_dev(mapping->backing_dev_info, NULL); > > } > > EXPORT_SYMBOL_GPL(page_cache_async_readahead); > > -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html