Hi, Dave, Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx> writes: > From: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx> > > Some method is obviously needed to enable reclaim-based migration. > > Just like traditional autonuma, there will be some workloads that > will benefit like workloads with more "static" configurations where > hot pages stay hot and cold pages stay cold. If pages come and go > from the hot and cold sets, the benefits of this approach will be > more limited. > > The benefits are truly workload-based and *not* hardware-based. > We do not believe that there is a viable threshold where certain > hardware configurations should have this mechanism enabled while > others do not. > > To be conservative, earlier work defaulted to disable reclaim- > based migration and did not include a mechanism to enable it. > This propses extending the existing "zone_reclaim_mode" (now > now really node_reclaim_mode) as a method to enable it. > > We are open to any alternative that allows end users to enable > this mechanism or disable it it workload harm is detected (just > like traditional autonuma). > > The implementation here is pretty simple and entirely unoptimized. > On any memory hotplug events, assume that a node was added or > removed and recalculate all migration targets. This ensures that > the node_demotion[] array is always ready to be used in case the > new reclaim mode is enabled. This recalculation is far from > optimal, most glaringly that it does not even attempt to figure > out if nodes are actually coming or going. > > Signed-off-by: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx> > Cc: Yang Shi <yang.shi@xxxxxxxxxxxxxxxxx> > Cc: David Rientjes <rientjes@xxxxxxxxxx> > Cc: Huang Ying <ying.huang@xxxxxxxxx> > Cc: Dan Williams <dan.j.williams@xxxxxxxxx> > --- > > b/Documentation/admin-guide/sysctl/vm.rst | 9 ++++ > b/mm/migrate.c | 61 +++++++++++++++++++++++++++++- > b/mm/vmscan.c | 7 +-- > 3 files changed, 73 insertions(+), 4 deletions(-) > > diff -puN Documentation/admin-guide/sysctl/vm.rst~enable-numa-demotion Documentation/admin-guide/sysctl/vm.rst > --- a/Documentation/admin-guide/sysctl/vm.rst~enable-numa-demotion 2020-06-29 16:35:01.012312549 -0700 > +++ b/Documentation/admin-guide/sysctl/vm.rst 2020-06-29 16:35:01.021312549 -0700 > @@ -941,6 +941,7 @@ This is value OR'ed together of > 1 (bit currently ignored) > 2 Zone reclaim writes dirty pages out > 4 Zone reclaim swaps pages > +8 Zone reclaim migrates pages > = =================================== > > zone_reclaim_mode is disabled by default. For file servers or workloads > @@ -965,3 +966,11 @@ of other processes running on other node > Allowing regular swap effectively restricts allocations to the local > node unless explicitly overridden by memory policies or cpuset > configurations. > + > +Page migration during reclaim is intended for systems with tiered memory > +configurations. These systems have multiple types of memory with varied > +performance characteristics instead of plain NUMA systems where the same > +kind of memory is found at varied distances. Allowing page migration > +during reclaim enables these systems to migrate pages from fast tiers to > +slow tiers when the fast tier is under pressure. This migration is > +performed before swap. > diff -puN mm/migrate.c~enable-numa-demotion mm/migrate.c > --- a/mm/migrate.c~enable-numa-demotion 2020-06-29 16:35:01.015312549 -0700 > +++ b/mm/migrate.c 2020-06-29 16:35:01.022312549 -0700 > @@ -49,6 +49,7 @@ > #include <linux/sched/mm.h> > #include <linux/ptrace.h> > #include <linux/oom.h> > +#include <linux/memory.h> > > #include <asm/tlbflush.h> > > @@ -3165,6 +3166,10 @@ void set_migration_target_nodes(void) > * Avoid any oddities like cycles that could occur > * from changes in the topology. This will leave > * a momentary gap when migration is disabled. > + * > + * This is superfluous for memory offlining since > + * MEM_GOING_OFFLINE does it independently, but it > + * does not hurt to do it a second time. > */ > disable_all_migrate_targets(); > > @@ -3211,6 +3216,60 @@ again: > /* Is another pass necessary? */ > if (!nodes_empty(next_pass)) > goto again; > +} > > - put_online_mems(); > +/* > + * React to hotplug events that might online or offline > + * NUMA nodes. > + * > + * This leaves migrate-on-reclaim transiently disabled > + * between the MEM_GOING_OFFLINE and MEM_OFFLINE events. > + * This runs whether RECLAIM_MIGRATE is enabled or not. > + * That ensures that the user can turn RECLAIM_MIGRATE > + * without needing to recalcuate migration targets. > + */ > +#if defined(CONFIG_MEMORY_HOTPLUG) > +static int __meminit migrate_on_reclaim_callback(struct notifier_block *self, > + unsigned long action, void *arg) > +{ > + switch (action) { > + case MEM_GOING_OFFLINE: > + /* > + * Make sure there are not transient states where > + * an offline node is a migration target. This > + * will leave migration disabled until the offline > + * completes and the MEM_OFFLINE case below runs. > + */ > + disable_all_migrate_targets(); > + break; > + case MEM_OFFLINE: > + case MEM_ONLINE: > + /* > + * Recalculate the target nodes once the node > + * reaches its final state (online or offline). > + */ > + set_migration_target_nodes(); > + break; > + case MEM_CANCEL_OFFLINE: > + /* > + * MEM_GOING_OFFLINE disabled all the migration > + * targets. Reenable them. > + */ > + set_migration_target_nodes(); > + break; > + case MEM_GOING_ONLINE: > + case MEM_CANCEL_ONLINE: > + break; > + } > + > + return notifier_from_errno(0); > } > + > +static int __init migrate_on_reclaim_init(void) > +{ > + hotplug_memory_notifier(migrate_on_reclaim_callback, 100); > + return 0; > +} > +late_initcall(migrate_on_reclaim_init); > +#endif /* CONFIG_MEMORY_HOTPLUG */ > + > diff -puN mm/vmscan.c~enable-numa-demotion mm/vmscan.c > --- a/mm/vmscan.c~enable-numa-demotion 2020-06-29 16:35:01.017312549 -0700 > +++ b/mm/vmscan.c 2020-06-29 16:35:01.023312549 -0700 > @@ -4165,9 +4165,10 @@ int node_reclaim_mode __read_mostly; > * These bit locations are exposed in the vm.zone_reclaim_mode sysctl > * ABI. New bits are OK, but existing bits can never change. > */ > -#define RECLAIM_RSVD (1<<0) /* (currently ignored/unused) */ > -#define RECLAIM_WRITE (1<<1) /* Writeout pages during reclaim */ > -#define RECLAIM_UNMAP (1<<2) /* Unmap pages during reclaim */ > +#define RECLAIM_RSVD (1<<0) /* (currently ignored/unused) */ > +#define RECLAIM_WRITE (1<<1) /* Writeout pages during reclaim */ > +#define RECLAIM_UNMAP (1<<2) /* Unmap pages during reclaim */ > +#define RECLAIM_MIGRATE (1<<3) /* Migrate pages during reclaim */ > > /* > * Priority for NODE_RECLAIM. This determines the fraction of pages I found that RECLAIM_MIGRATE is defined but never referenced in the patch. If my understanding of the code were correct, shrink_do_demote_mapping() is called by shrink_page_list(), which is used by kswapd and direct reclaim. So as long as the persistent memory node is onlined, reclaim-based migration will be enabled regardless of node reclaim mode. Best Regards, Huang, Ying