On Thu, Jul 04, 2013 at 03:36:38PM -0400, Johannes Weiner wrote: > I was going for the opposite conclusion: that it does not matter > whether memory is accessed privately or in a shared fashion, because > there is no obvious connection to its access frequency, not to me at > least. There is a relation to access freq; however due to the low sample rate (once every 100ms or so) we obviously miss all high freq data there. > > I acknowledge it's a problem and basically I'm making a big assumption > > that private-dominated workloads are going to be the common case. Threaded > > application on UMA with heavy amounts of shared data (within cache lines) > > already suck in terms of performance so I'm expecting programmers already > > try and avoid this sort of sharing. Obviously we are at a page granularity > > here so the assumption will depend entirely on alignments and buffer sizes > > so it might still fall apart. > > Don't basically all VM-based mulithreaded programs have this usage > pattern? The whole runtime (text, heap) is shared between threads. > If some thread-local memory spills over to another node, should the > scheduler move this thread off node from a memory standpoint? I don't > think so at all. I would expect it to always gravitate back towards > this node with the VM on it, only get moved off for CPU load reasons, > and get moved back as soon as the load situation permits. All data being allocated on the same heap and being shared in the access sense doesn't imply all threads will indeed use all data; even if TLS is not used. For a concurrent program to reach any useful level of concurrency gain you need data partitioning. Threads must work on different data sets otherwise they'd constantly be waiting on serialization -- which makes your concurrency gain tank. There's two main issues here: Firstly; the question is if there's much false sharing on page granularity. Typically you want the compute time per data fragment to be significantly higher than the demux + mux overhead which favours larger data units. Secondly; you want your scan freq to be at least half the compute time per data fragment. Otherwise you'll run the risk of not seeing the data being local to that thread. So for optimal benefit you want to minimize sharing pages between data fragments and have your data fragment compute time as long as possible. Luckily both are also goals for maximizing concurrency gain so we should be good there. This should cover all 'traditional' concurrent stuff; most of the 'new' concurrency stuff can be different though -- some of it simply never thought/designed for concurrency and just hopes it works. Others most notably the multi-core concurrency stuff assumes the demux+mux cost are _very_ low and therefore the data fragment and associated compute time shrink to useless levels :/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>