I don't think we will ever get the default value for this tunable right. I would also worry that avoiding the reclaim_mode for file-backed cache will hurt HPC applications that are dumping their data to disk and depending on the existing default for zone_reclaim_mode to not pollute other nodes. The ideal would be if distribution packages for mail, web servers and others that are heavily IO orientated would prompt for a change to the default value of zone_reclaim_mode in sysctl.
I would argue that there's a lot more mail/web/file servers out there than HPC machines. And HPC machines tend to have a team of people to monitor/tweak them. I think it would be much more sane to default this to 0 which works best for most people, and get the HPC people to change it.
However there's still another question, why is this problem happening at all for us? I know almost nothing about NUMA, but from other posts, it sounds like the problem is the memory allocations are all happening on one node? But I don't understand why that would be happening. The machine runs the cyrus IMAP server, which is a classic unix forking server with 1000's of processes. Each process will mmap lots of different files to access them. Why would that all be happening on one node, not spread around?
One thing is that the machine is vastly more IO loaded than CPU loaded, in fact it uses very little CPU at all (a few % usually). Does the kernel prefer to run processes on one particular node if it's available? So if a machine has very little CPU load, every process will generally end up running on the same node?
Rob -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>