RFC Automatic Page Migration At the Linux Plumber's conference, Andi Kleen encouraged me again to resubmit my automatic page migration patches because he thinks they will be useful for virtualization. Later, in the Virtualization mini-conf, the subject came up during a presentation about adding NUMA awareness to qemu/kvm. After the presentation, I discussed these series with Andrea Arcangeli and he also encouraged me to post them. My position within HP has changed such that I'm not sure how much time I'll have to spend on this area nor whether I'll have access to the larger NUMA platforms on which to test the patches thoroughly. However, here is the third of 4 series that comprise my shared policy enhancements and lazy/auto-migration enhancement. I have rebased the patches against a recent mmotm tree. This rebase built cleanly, booted and passed a few ad hoc tests on x86_64. I've made a pass over the patch descriptions to update them. If there is sufficient interest in merging this, I'll do what I can to assist in the completion and testing of the series. Based atop the previously posted: 1) Shared policy cleanup, fixes, mapped file policy 2) Migrate-on-fault a.k.a. Lazy Page Migration facility To follow: 4) a Migration Cache -- originally written by Marcello Tosatti I'll announce this series and the automatic/lazy migration series to follow on lkml, linux-mm, ... However, I'll limit the actual posting to linux-numa to avoid spamming the other lists. --- This series of patches hooks up linux page migration to the task load balancing mechanism. The effect is such that, when load balancing moves a task to a cpu on a different node from where the task last executed, the task is notified of this change using a variant of the mechanism used to notify a task of pending signals. When the task returns to user state, it attempts to migrate, to the new node, any pages not already on that node in those of the task's vm areas under control of default policy. By default, the task will use lazy migration to migrate "misplaced" pages. When notified of an inter-node migration, the task will walk its address space, attempting to unmap [remove all ptes] any anonymous pages in the tasks page table. When the task subsequently touchs any of these unmapped pages, it will include a swap page fault. The swap fault handler will either restore the pte if the cached page's location matches it's mempolicy, otherwise the "migrate-on-fault" mechanism will attempt to migrate the page to the correct node. Lazy migration may be disabled by writing zero to the per cpuset auto_migrate_lazy file. In that case, automigration will use direct, synchronous migration to pull all anonymous pages mapped by the task to new node. Why lazy migration by default? Think of the effect of direct, synchronous migration, in this context, on large multi-threaded programs. Automatic page migration is disabled by default, but can be enabled by writing non-zero to the per cpuset auto_migrate_enable file. Furthermore, to prevent thrashing, this series provides a second, experimental per cpuset control, auto_migrate_interval. The load balancer will not move a task to a different node if it has move to a new node in the last auto_migrate_interval seconds. [User interface is in seconds; internally it's in HZ.] The idea is to give the task time to ammortize the cost of the migration by giving it time to benefit from local references to the page. Some experimenting and tuning will be necessary to determine the appropriate default value for this parameter on various platforms. An additional per cpuset control -- migrate_max_mapcount -- adjusts the threshold page mapcount at which non-privileged users can migrate shared pages. This control allows experimentation with more aggressive auto-migration. Why "per cpuset controls"? Originally, cpusets was the only convenient "soft partitioning" or "task grouping" mechanism available. Now that "containers" or "control groups" are available, one might consider a "NUMA behavior" control group, orthogonal to cpusets, to control this sort behavior. However, because cpusets are closely tied to NUMA resource partitioning and locality management, it still seems like a good place to contain the migration and mempolicy behavior controls. Finally, the series adds a per process control file -- /proc/<pid>/migrate. Writing to this file causes the task to simulate an internode migration by walking its address space and unmapping anonymous pages so that they will be checked for [mis]placement on next touch; or by directly migrating them if lazy migration is disabled for the task's cpuset. This can be used to test the automigration facility or to force a task to reestablish it's anonymous page NUMA footprint at any time. --- Lee Schermerhorn -- To unsubscribe from this list: send the line "unsubscribe linux-numa" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html