Hi, Wu. On Mon, Jan 4, 2010 at 1:50 PM, Wu Fengguang <fengguang.wu@xxxxxxxxx> wrote: > This fixes inefficient page-by-page reads on POSIX_FADV_RANDOM. > > POSIX_FADV_RANDOM used to set ra_pages=0, which leads to poor > performance: a 16K read will be carried out in 4 _sync_ 1-page reads. > > In other places, ra_pages==0 means > - it's ramfs/tmpfs/hugetlbfs/sysfs/configfs > - some IO error happened > where multi-page read IO won't help or should be avoided. > > POSIX_FADV_RANDOM actually want a different semantics: to disable the > *heuristic* readahead algorithm, and to use a dumb one which faithfully > submit read IO for whatever application requests. > > So introduce a flag O_RANDOM for POSIX_FADV_RANDOM. > It will be visible to fcntl(F_GETFL). > > Note that the random hint is not likely to help random reads performance > noticeably. And it may be too permissive on huge request size (its IO > size is not limited by read_ahead_kb). > > In Quentin's report (http://lkml.org/lkml/2009/12/24/145), the overall > (NFS read) performance of the application increased by 313%! > > v3: use O_RANDOM to indicate both read/write access pattern as in > posix_fadvise(), although it only takes effect for read() now > (proposed by Quentin) > v2: use O_RANDOM_READ to avoid race conditions (pointed out by Andi) > > CC: Nick Piggin <npiggin@xxxxxxx> > CC: Andi Kleen <andi@xxxxxxxxxxxxxx> > CC: Steven Whitehouse <swhiteho@xxxxxxxxxx> > CC: David Howells <dhowells@xxxxxxxxxx> > CC: Al Viro <viro@xxxxxxxxxxxxxxxxxx> > CC: Jonathan Corbet <corbet@xxxxxxx> > CC: Christoph Hellwig <hch@xxxxxxxxxxxxx> > Tested-by: Quentin Barnes <qbarnes+nfs@xxxxxxxxxxxxx> > Signed-off-by: Wu Fengguang <fengguang.wu@xxxxxxxxx> > --- > include/asm-generic/fcntl.h | 4 ++++ > mm/fadvise.c | 10 +++++++++- > mm/readahead.c | 6 ++++++ > 3 files changed, 19 insertions(+), 1 deletion(-) > > --- linux.orig/include/asm-generic/fcntl.h 2010-01-04 12:39:29.000000000 +0800 > +++ linux/include/asm-generic/fcntl.h 2010-01-04 12:40:11.000000000 +0800 > @@ -80,6 +80,10 @@ > #define O_NDELAY O_NONBLOCK > #endif > > +#ifndef O_RANDOM > +#define O_RANDOM 010000000 /* random access pattern hint */ > +#endif > + > #define F_DUPFD 0 /* dup */ > #define F_GETFD 1 /* get close_on_exec */ > #define F_SETFD 2 /* set/clear close_on_exec */ > --- linux.orig/mm/fadvise.c 2010-01-04 12:39:29.000000000 +0800 > +++ linux/mm/fadvise.c 2010-01-04 12:39:30.000000000 +0800 > @@ -77,12 +77,20 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, lof > switch (advice) { > case POSIX_FADV_NORMAL: > file->f_ra.ra_pages = bdi->ra_pages; > + spin_lock(&file->f_lock); > + file->f_flags &= ~O_RANDOM; > + spin_unlock(&file->f_lock); > break; > case POSIX_FADV_RANDOM: > - file->f_ra.ra_pages = 0; > + spin_lock(&file->f_lock); > + file->f_flags |= O_RANDOM; > + spin_unlock(&file->f_lock); > break; > case POSIX_FADV_SEQUENTIAL: > file->f_ra.ra_pages = bdi->ra_pages * 2; > + spin_lock(&file->f_lock); > + file->f_flags &= ~O_RANDOM; > + spin_unlock(&file->f_lock); > break; > case POSIX_FADV_WILLNEED: > if (!mapping->a_ops->readpage) { > --- linux.orig/mm/readahead.c 2010-01-04 12:39:29.000000000 +0800 > +++ linux/mm/readahead.c 2010-01-04 12:39:30.000000000 +0800 > @@ -501,6 +501,12 @@ void page_cache_sync_readahead(struct ad > if (!ra->ra_pages) > return; > > + /* be dumb */ > + if (filp->f_flags & O_RANDOM) { > + force_page_cache_readahead(mapping, filp, offset, req_size); > + return; > + } > + Let me have a dumb question. :) How about testing O_RANDOM in front of ra_pages testing? My intention is that although we turn off ra, it would be better to read contiguous block all at once than readpage() callback doing I/O one page at a time. Is it break some semantics or happen some problem in ondemand readahead? > /* do read-ahead */ > ondemand_readahead(mapping, ra, filp, false, offset, req_size); > } > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html