On Wed 11-12-13 15:05:22, Andrew Morton wrote: > On Wed, 11 Dec 2013 23:49:17 +0100 Jan Kara <jack@xxxxxxx> wrote: > > > > /* > > > - * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a > > > - * sensible upper limit. > > > + * max_sane_readahead() is disabled. It can later be removed altogether, but > > > + * let's keep a skeleton in place for now, in case disabling was the wrong call. > > > */ > > > unsigned long max_sane_readahead(unsigned long nr) > > > { > > > - return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE) > > > - + node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2); > > > + return nr; > > > } > > > > > > /* > > > > > > Can anyone see a problem with this? > > Well, the downside seems to be that if userspace previously issued > > MADV/FADV_WILLNEED on a huge file, we trimmed the request to a sensible > > size. Now we try to read the whole huge file which is pretty much > > guaranteed to be useless (as we'll be pushing out of cache data we just > > read a while ago). And guessing the right readahead size from userspace > > isn't trivial so it would make WILLNEED advice less useful. What do you > > think? > > OK, yes, there is conceivably a back-compatibility issue there. There > indeed might be applications which decide the chuck the whole thing at > the kernel and let the kernel work out what is a sensible readahead > size to perform. > > But I'm really struggling to think up an implementation! The current > code looks only at the caller's node and doesn't seem to make much > sense. Should we look at all nodes? Hard to say without prior > knowledge of where those pages will be coming from. Well, I believe that we might have some compatibility issues only for non-NUMA machines - there the current logic makes sense. For NUMA machines I believe we are free to do basically anything because results of the current logic are pretty random. Thinking about proper implementation for NUMA - max_sane_readahead() is really interesting for madvise() and fadvise() calls (standard on demand readahead is bounded by bdi->ra_pages which tends to be pretty low anyway (like 512K or so)). For these calls we will do the reads from the process issuing the [fm]advise() call and thus we will allocate pages depending on the NUMA policy. So depending on this policy we should be able to pick some estimate on the number of available pages, shouldn't we? BTW, the fact that [fm]advise() calls submit all reads synchronously is another reason why we should bound the readahead requests to a sensible size. Honza -- Jan Kara <jack@xxxxxxx> SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>