[PATCH/RFC 0/11] numa - Automatic-migration

Lee Schermerhorn <lee.schermerhorn@xxxxxx> · Thu, 11 Nov 2010 15:01:03 -0500

RFC Automatic Page Migration

At the Linux Plumber's conference, Andi Kleen encouraged me again
to resubmit my automatic page migration patches because he thinks
they will be useful for virtualization.  Later, in the Virtualization
mini-conf, the subject came up during a presentation about adding
NUMA awareness to qemu/kvm.  After the presentation, I discussed
these series with Andrea Arcangeli and he also encouraged me to
post them.  My position within HP has changed such that I'm not
sure how much time I'll have to spend on this area nor whether I'll
have access to the larger NUMA platforms on which to test the
patches thoroughly.  However, here is the third of 4 series that
comprise my shared policy enhancements and lazy/auto-migration
enhancement.

I have rebased the patches against a recent mmotm tree.  This
rebase built cleanly, booted and passed a few ad hoc tests on
x86_64.  I've made a pass over the patch descriptions to update
them.  If there is sufficient interest in merging this, I'll
do what I can to assist in the completion and testing of the
series.

Based atop the previously posted:

1) Shared policy cleanup, fixes, mapped file policy
2) Migrate-on-fault a.k.a. Lazy Page Migration facility

To follow:

4)  a Migration Cache -- originally written by Marcello Tosatti

I'll announce this series and the automatic/lazy migration series
to follow on lkml, linux-mm, ...  However, I'll limit the actual
posting to linux-numa to avoid spamming the other lists.

---

This series of patches hooks up linux page migration to the task load
balancing mechanism.  The effect is such that, when load balancing moves
a task to a cpu on a different node from where the task last executed,
the task is notified of this change using a variant of the mechanism used
to notify a task of pending signals.  When the task returns to user state,
it attempts to migrate, to the new node, any pages not already on that
node in those of the task's vm areas under control of default policy.

By default, the task will use lazy migration to migrate "misplaced"
pages.  When notified of an inter-node migration, the task will
walk its address space, attempting to unmap [remove all ptes] any
anonymous pages in the tasks page table.  When the task subsequently
touchs any of these unmapped pages, it will include a swap page
fault.  The swap fault handler will either restore the pte if the
cached page's location matches it's mempolicy, otherwise the
"migrate-on-fault" mechanism will attempt to migrate the page to
the correct node.

Lazy migration may be disabled by writing zero to the per cpuset
auto_migrate_lazy file.  In that case, automigration will use
direct, synchronous migration to pull all anonymous pages mapped
by the task to new node.

	Why lazy migration by default?  Think of the effect
	of direct, synchronous migration, in this context,
	on large multi-threaded programs.

Automatic page migration is disabled by default, but can be enabled by
writing non-zero to the per cpuset auto_migrate_enable file.
Furthermore, to prevent thrashing, this series provides a second,
experimental per cpuset control, auto_migrate_interval.  The load
balancer will not move a task to a different node if it has move to a
new node in the last auto_migrate_interval seconds.  [User interface
is in seconds; internally it's in HZ.]  The idea is to give the task
time to ammortize the cost of the migration by giving it time to
benefit from local references to the page.  Some experimenting and
tuning will be necessary to determine the appropriate default value
for this parameter on various platforms.

An additional per cpuset control -- migrate_max_mapcount -- adjusts
the threshold page mapcount at which non-privileged users can migrate
shared pages.  This control allows experimentation with more aggressive
auto-migration.

Why "per cpuset controls"?  Originally, cpusets was the only convenient
"soft partitioning" or "task grouping" mechanism available.  Now that
"containers" or "control groups" are available, one might consider
a "NUMA behavior" control group, orthogonal to cpusets, to control this
sort behavior.  However, because cpusets are closely tied to NUMA resource
partitioning and locality management, it still seems like a good place to
contain the migration and mempolicy behavior controls.

Finally, the series adds a per process control file -- /proc/<pid>/migrate.
Writing to this file causes the task to simulate an internode migration
by walking its address space and unmapping anonymous pages so that they
will be checked for [mis]placement on next touch; or by directly migrating
them if lazy migration is disabled for the task's cpuset.  This can be
used to test the automigration facility or to force a task to reestablish
it's anonymous page NUMA footprint at any time.

---

Lee Schermerhorn
--
To unsubscribe from this list: send the line "unsubscribe linux-numa" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html