On Thu, Jul 03, 2014 at 02:58:54PM -0700, Linus Torvalds wrote: > On Thu, Jul 3, 2014 at 12:43 PM, John Stoffel <john@xxxxxxxxxxx> wrote: > > > > This is one of those perenial questions of how to tune this. I agree > > we should increase the number, but shouldn't it be based on both the > > amount of memory in the machine, number of devices (or is it all just > > one big pool?) and the speed of the actual device doing readahead? > > Sure. But I don't trust the throughput data for the backing device at > all, especially early at boot. We're supposed to work it out for > writeback over time (based on device contention etc), but I never saw > that working, and for reading I don't think we have even any code to > do so. > > And trying to be clever and basing the read-ahead size on the node > memory size was what caused problems to begin with (with memory-less > nodes) that then made us just hardcode the maximum. > > So there are certainly better options - in theory. In practice, I > think we don't really care enough, and the better options are > questionably implementable. > > I _suspect_ the right number is in that 2-8MB range, and I would > prefer to keep it at the low end at least until somebody really has > numbers (and preferably from different real-life situations). > > I also suspect that read-ahead is less of an issue with non-rotational > storage in general, since the only real reason for it tends to be > latency reduction (particularly the "readahead()" kind of big-hammer > thing that is really just useful for priming caches). So there's some > argument to say that it's getting less and less important. > I believe you're right, but yet we sort of broke the expectation for deliberately issued readaheads, and that is what I believe that fellow complained (poorly worded) at the bugzilla ticket. We recently got the following report: https://bugzilla.redhat.com/show_bug.cgi?id=1103240 which pretty much is the same thing reported at kernel's BZ. I did some empirical tests with iozone (forcing madv_willneed behaviour) as well as I double-checked numbers our performance team got while running their regression tests and I, honestly, couldn't see any change for better or worse that could be directly related to the change in question. Other than setting a hard ceiling of to 2MB for any issued readahead, which might be seen as trouble for certain users, there seems to be no other measurable loss here. OTOH, the tangible gain after the change is having the readahead working for NUMA layouts where some CPUs are within a memoryless node. I believe we could take David's (Rientjes) early suggestion and, instead of fixing a hard limit on max_sane_readahead(), change it to replace numa_node_id() by numa_mem_id() calls and follow up the CONFIG_HAVE_MEMORYLESS_NODES requirements on PPC to have it working properly (which seems to be the reason that approach was left aside). Best regards, -- Rafael -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>