Hi everyone, On Mon, Nov 15, 2010 at 08:13:14AM -0600, Christoph Lameter wrote: > On Sun, 14 Nov 2010, KOSAKI Motohiro wrote: > > > Nice! > > Lets not get overenthused. There has been no conclusive proof that the > overhead introduced by automatic migration schemes is consistently less > than the benefit obtained by moving the data. Quite to the contrary. We > have over a decades worth of research and attempts on this issue and there > was no general improvement to be had that way. > > The reason that the manual placement interfaces exist is because there was > no generally beneficial migration scheme available. The manual interfaces > allow the writing of various automatic migrations schemes in user space. > > If wecan come up with something that is an improvement then lets go > this way but I am skeptical. I generally find the patchset very interesting but I think like Christoph. It's good to give the patchset more visibility as it's quite unique in this area, but when talking with Lee I also thought the synchronous migrate on fault was probably too aggressive and I like an algorithm where memory follows cpus and cpus follow memory in a total dynamic way. I suggested Lee during our chat (and also to others during KS+Plubers) that we need a more dynamic algorithm that works in the background asynchronously. Specifically I want the cpu to follow memory closely whenever idle status allows it (change cpu in context switch is cheap, I don't like pinning or "single" home node concept) and then memory slowly also in tandem follow cpu in the background with kernel thread. So that both having cpu follow memory fast, and memory follow cpu slow, eventually things over time should converge in a optimal behavior. I like the migration done from a kthread like khugepaged/ksmd, not synchronously adding latency to page fault (or having to take down ptes to trigger the migrate on fault, migrate never need to require the app to exit kernel and take a fault just to migrate, it happens transparently as far as userland is concerned, well of course unless it trips on the migration pte just at the wrong time :). So the patchset looks very interesting, and it may actually be optimal for some slower hardware, but I've the perception these days the memory being remote isn't as a big deal as not keeping all two memory controllers in action simultaneously (using just one controller is worse than using both simultaneously from the wrong end, locality not as important as not stepping in each other toes). So in general synchronous migrate on fault seems a bit too aggressive to me and not ideal for newer hardware. Still this is one of the most interesting patchsets at this time in this area I've seen so far. The homenode logic ironically may be optimal with the most important bench because the way that bench is setup all vm are fairly small and there are plenty of them so it'll never happen that a vm has more memory than what can fit in the ram of a single node, but I like dynamic approach that works best in all environments, even if it's not clearly as simple and maybe not as optimal in the one relevant benchmark we care about. I'm unsure what the homenode is supposed to decide when the task has two three four times the ram that fits in a single node (and that may not be a so uncommon scenario after all). I admit not having read enough on this homenode logic, but I never got any attraction to it personally as there should never be any single "home" to any task in my view. -- To unsubscribe from this list: send the line "unsubscribe linux-numa" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html