On Tue, Apr 08, 2014 at 05:58:21PM -0500, Christoph Lameter wrote: > On Tue, 8 Apr 2014, Robert Haas wrote: > > > Well, as Josh quite rightly said, the hit from accessing remote memory > > is never going to be as large as the hit from disk. If and when there > > is a machine where remote memory is more expensive to access than > > disk, that's a good argument for zone_reclaim_mode. But I don't > > believe that's anywhere close to being true today, even on an 8-socket > > machine with an SSD. > > I am nost sure how disk figures into this? > It's a matter of perspective. For those that are running file servers, databases and the like they don't see the remote accesses, they see their page cache getting reclaimed but not all of those users understand why because they are not NUMA aware. This is why they are seeing the cost of zone_reclaim_mode to be IO-related. I think pretty much 100% of the bug reports I've seen related to zone_reclaim_mode were due to IO-intensive workloads and the user not recognising why page cache was getting reclaimed aggressively. > The tradeoff is zone reclaim vs. the aggregate performance > degradation of the remote memory accesses. That depends on the > cacheability of the app and the scale of memory accesses. > For HPC, yes. > The reason that zone reclaim is on by default is that off node accesses > are a big performance hit on large scale NUMA systems (like ScaleMP and > SGI). Zone reclaim was written *because* those system experienced severe > performance degradation. > Yes, this is understood. However, those same people already know how to use cpusets, NUMA bindings and how tune their workload to partition it into the nodes. From a NUMA perspective they are relatively sophisticated and know how and when to set zone_reclaim_mode. At least on any bug report I've seen related to these really large machines, they were already using cpusets. This is why I think think the default for zone_reclaim should now be off because it helps the common case. > On the tightly coupled 4 and 8 node systems there does not seem to > be a benefit from what I hear. > > > Now, perhaps the fear is that if we access that remote memory > > *repeatedly* the aggregate cost will exceed what it would have cost to > > fault that page into the local node just once. But it takes a lot of > > accesses for that to be true, and most of the time you won't get them. > > Even if you do, I bet many workloads will prefer even performance > > across all the accesses over a very slow first access followed by > > slightly faster subsequent accesses. > > Many HPC workloads prefer the opposite. > And they know how to tune accordingly. > > In an ideal world, the kernel would put the hottest pages on the local > > node and the less-hot pages on remote nodes, moving pages around as > > the workload shifts. In practice, that's probably pretty hard. > > Fortunately, it's not nearly as important as making sure we don't > > unnecessarily hit the disk, which is infinitely slower than any memory > > bank. > > Shifting pages involves similar tradeoffs as zone reclaim vs. remote > allocations. In practice it really is hard for the kernel to do this automatically. Automatic NUMA balancing will help if the data is mapped but not if it's buffered read/writes because there is no hinting information available right now. At some point we may need to tackle IO locality but it'll take time for users to get experience with automatic balancing as it is before taking further steps. That's an aside to the current discussion. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>