Re: [PATCH/RFC 0/8] numa - Migrate-on-Fault

Andrea Arcangeli <aarcange@xxxxxxxxxx> · Mon, 15 Nov 2010 15:33:50 +0100

Hi everyone,

On Mon, Nov 15, 2010 at 08:13:14AM -0600, Christoph Lameter wrote:
> On Sun, 14 Nov 2010, KOSAKI Motohiro wrote:
> 
> > Nice!
> 
> Lets not get overenthused. There has been no conclusive proof that the
> overhead introduced by automatic migration schemes is consistently less
> than the benefit obtained by moving the data. Quite to the contrary. We
> have over a decades worth of research and attempts on this issue and there
> was no general improvement to be had that way.
> 
> The reason that the manual placement interfaces exist is because there was
> no generally beneficial migration scheme available. The manual interfaces
> allow the writing of various automatic migrations schemes in user space.
> 
> If wecan come up with something that is an improvement then lets go
> this way but I am skeptical.

I generally find the patchset very interesting but I think like
Christoph.

It's good to give the patchset more visibility as it's quite unique in
this area, but when talking with Lee I also thought the synchronous
migrate on fault was probably too aggressive and I like an algorithm
where memory follows cpus and cpus follow memory in a total dynamic
way.

I suggested Lee during our chat (and also to others during KS+Plubers)
that we need a more dynamic algorithm that works in the background
asynchronously. Specifically I want the cpu to follow memory closely
whenever idle status allows it (change cpu in context switch is cheap,
I don't like pinning or "single" home node concept) and then memory
slowly also in tandem follow cpu in the background with kernel
thread. So that both having cpu follow memory fast, and memory follow
cpu slow, eventually things over time should converge in a optimal
behavior. I like the migration done from a kthread like
khugepaged/ksmd, not synchronously adding latency to page fault (or
having to take down ptes to trigger the migrate on fault, migrate
never need to require the app to exit kernel and take a fault just to
migrate, it happens transparently as far as userland is concerned,
well of course unless it trips on the migration pte just at the wrong
time :).

So the patchset looks very interesting, and it may actually be optimal
for some slower hardware, but I've the perception these days the
memory being remote isn't as a big deal as not keeping all two memory
controllers in action simultaneously (using just one controller is
worse than using both simultaneously from the wrong end, locality not
as important as not stepping in each other toes). So in general
synchronous migrate on fault seems a bit too aggressive to me and not
ideal for newer hardware. Still this is one of the most interesting
patchsets at this time in this area I've seen so far.

The homenode logic ironically may be optimal with the most important
bench because the way that bench is setup all vm are fairly small and
there are plenty of them so it'll never happen that a vm has more
memory than what can fit in the ram of a single node, but I like
dynamic approach that works best in all environments, even if it's not
clearly as simple and maybe not as optimal in the one relevant
benchmark we care about. I'm unsure what the homenode is supposed to
decide when the task has two three four times the ram that fits in a
single node (and that may not be a so uncommon scenario after all).
I admit not having read enough on this homenode logic, but I never got
any attraction to it personally as there should never be any single
"home" to any task in my view.
--
To unsubscribe from this list: send the line "unsubscribe linux-numa" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html