On Tue, Apr 8, 2014 at 10:17 AM, Christoph Lameter <cl@xxxxxxxxx> wrote: > Another solution here would be to increase the threshhold so that > 4 socket machines do not enable zone reclaim by default. The larger the > NUMA system is the more memory is off node from the perspective of a > processor and the larger the hit from remote memory. Well, as Josh quite rightly said, the hit from accessing remote memory is never going to be as large as the hit from disk. If and when there is a machine where remote memory is more expensive to access than disk, that's a good argument for zone_reclaim_mode. But I don't believe that's anywhere close to being true today, even on an 8-socket machine with an SSD. Now, perhaps the fear is that if we access that remote memory *repeatedly* the aggregate cost will exceed what it would have cost to fault that page into the local node just once. But it takes a lot of accesses for that to be true, and most of the time you won't get them. Even if you do, I bet many workloads will prefer even performance across all the accesses over a very slow first access followed by slightly faster subsequent accesses. In an ideal world, the kernel would put the hottest pages on the local node and the less-hot pages on remote nodes, moving pages around as the workload shifts. In practice, that's probably pretty hard. Fortunately, it's not nearly as important as making sure we don't unnecessarily hit the disk, which is infinitely slower than any memory bank. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>