Re: [PATCH -mm] do_migrate_pages() calls migrate_to_node() even if task is already on a correct node

KOSAKI Motohiro <mkosaki@xxxxxxxxxx> · Thu, 22 Mar 2012 15:07:39 -0400

(3/22/12 2:51 PM), Christoph Lameter wrote:
On Thu, 22 Mar 2012, KOSAKI Motohiro wrote:

CC to Christoph.

While moving tasks between cpusets I noticed some strange behavior.
Specifically if the nodes of the destination
cpuset are a subset of the nodes of the source cpuset do_migrate_pages()
will move pages that are already on a node
in the destination cpuset. The reason for this is do_migrate_pages() does
not check whether each node in the source
nodemask is in the destination nodemask before calling migrate_to_node(). If
we simply do this check and skip them
when the source is in the destination moving we wont move nodes that dont
need to be moved.

Adding a little debug printk to migrate_to_node():

Without this change migrating tasks from a cpuset containing nodes 0-7 to a
cpuset containing nodes 3-4, we migrate
from ALL the nodes even if they are in the both the source and destination
nodesets:

Migrating 7 to 4
Migrating 6 to 3
Migrating 5 to 4
Migrating 4 to 3
Migrating 1 to 4
Migrating 3 to 4
Migrating 0 to 3
Migrating 2 to 3

Wait.

This may be non-optimal for cpusets, but maybe optimal migrate_pages,
especially
the usecase is HPC. I guess this is intended behavior. I think we need to hear
Christoph's intention.

But, I'm not against this if he has no objection.

The use case for this is if you have an app running on nodes 3,4,5 on your
machine and now you want to shift it to 4,5,6. The expectation is that the
location of the pages relative to the first node stay the same.
Application may manage their locality given a range of nodes and each of
the x .. x+n nodes has their particular purpose.

If you justd copy 3 to 6 then the app may get confused when doing
additional allocations since different types of information is now stored
on the "first" node (which is now 4).

MPOL_INTERLEAVE is more simple situaltion. applications naturally assume the
memory is mapped intealeaving and application threads optimize for it. if we
broke intereaving, the applications may slow down.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>