On Wed, Jan 25, 2012 at 03:06:13PM -0500, Chris Mason wrote: > On Wed, Jan 25, 2012 at 12:37:48PM -0600, James Bottomley wrote: > > On Wed, 2012-01-25 at 13:28 -0500, Loke, Chetan wrote: > > > > So there are two separate problems mentioned here. The first is to > > > > ensure that readahead (RA) pages are treated as more disposable than > > > > accessed pages under memory pressure and then to derive a statistic for > > > > futile RA (those pages that were read in but never accessed). > > > > > > > > The first sounds really like its an LRU thing rather than adding yet > > > > another page flag. We need a position in the LRU list for never > > > > accessed ... that way they're first to be evicted as memory pressure > > > > rises. > > > > > > > > The second is you can derive this futile readahead statistic from the > > > > LRU position of unaccessed pages ... you could keep this globally. > > > > > > > > Now the problem: if you trash all unaccessed RA pages first, you end up > > > > with the situation of say playing a movie under moderate memory > > > > pressure that we do RA, then trash the RA page then have to re-read to display > > > > to the user resulting in an undesirable uptick in read I/O. > > > > > > > > Based on the above, it sounds like a better heuristic would be to evict > > > > accessed clean pages at the top of the LRU list before unaccessed clean > > > > pages because the expectation is that the unaccessed clean pages will > > > > be accessed (that's after all, why we did the readahead). As RA pages age > > > > > > Well, the movie example is one case where evicting unaccessed page may not be the right thing to do. But what about a workload that perform a random one-shot search? > > > The search was done and the RA'd blocks are of no use anymore. So it seems one solution would hurt another. > > > > Well not really: RA is always wrong for random reads. The whole purpose > > of RA is assumption of sequential access patterns. > > Just to jump back, Jeff's benchmark that started this (on xfs and ext4): > > - buffered 1MB reads get down to the scheduler in 128KB chunks > > The really hard part about readahead is that you don't know what > userland wants. In Jeff's test, he's telling the kernel he wants 1MB > ios and our RA engine is doing 128KB ios. > > We can talk about scaling up how big the RA windows get on their own, > but if userland asks for 1MB, we don't have to worry about futile RA, we > just have to make sure we don't oom the box trying to honor 1MB reads > from 5000 different procs. Right - if we know the read request is larger than the RA window, then we should ignore the RA window and just service the request in a single bio. Well, at least, in chunks as large as the underlying device will allow us to build.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html