On 15 Jul 2021, at 1:51, Huang Ying wrote: > From: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx> > > Prepare for the kernel to auto-migrate pages to other memory nodes > with a node migration table. This allows creating single migration > target for each NUMA node to enable the kernel to do NUMA page > migrations instead of simply discarding colder pages. A node with no > target is a "terminal node", so reclaim acts normally there. The > migration target does not fundamentally _need_ to be a single node, > but this implementation starts there to limit complexity. > > When memory fills up on a node, memory contents can be > automatically migrated to another node. The biggest problems are > knowing when to migrate and to where the migration should be > targeted. > > The most straightforward way to generate the "to where" list would > be to follow the page allocator fallback lists. Those lists > already tell us if memory is full where to look next. It would > also be logical to move memory in that order. > > But, the allocator fallback lists have a fatal flaw: most nodes > appear in all the lists. This would potentially lead to migration > cycles (A->B, B->A, A->B, ...). > > Instead of using the allocator fallback lists directly, keep a > separate node migration ordering. But, reuse the same data used > to generate page allocator fallback in the first place: > find_next_best_node(). > > This means that the firmware data used to populate node distances > essentially dictates the ordering for now. It should also be > architecture-neutral since all NUMA architectures have a working > find_next_best_node(). > > RCU is used to allow lock-less read of node_demotion[] and prevent > demotion cycles been observed. If multiple reads of node_demotion[] > are performed, a single rcu_read_lock() must be held over all reads to > ensure no cycles are observed. Details are as follows. > > === What does RCU provide? === > > Imaginge a simple loop which walks down the demotion path looking s/Imaginge/Imagine > for the last node: > > terminal_node = start_node; > while (node_demotion[terminal_node] != NUMA_NO_NODE) { > terminal_node = node_demotion[terminal_node]; > } > > The initial values are: > > node_demotion[0] = 1; > node_demotion[1] = NUMA_NO_NODE; > > and are updated to: > > node_demotion[0] = NUMA_NO_NODE; > node_demotion[1] = 0; > > What guarantees that the cycle is not observed: > > node_demotion[0] = 1; > node_demotion[1] = 0; > > and would loop forever? > > With RCU, a rcu_read_lock/unlock() can be placed around the > loop. Since the write side does a synchronize_rcu(), the loop > that observed the old contents is known to be complete before the > synchronize_rcu() has completed. > > RCU, combined with disable_all_migrate_targets(), ensures that > the old migration state is not visible by the time > __set_migration_target_nodes() is called. > > === What does READ_ONCE() provide? === > > READ_ONCE() forbids the compiler from merging or reordering > successive reads of node_demotion[]. This ensures that any > updates are *eventually* observed. > > Consider the above loop again. The compiler could theoretically > read the entirety of node_demotion[] into local storage > (registers) and never go back to memory, and *permanently* > observe bad values for node_demotion[]. > > Note: RCU does not provide any universal compiler-ordering > guarantees: > > https://lore.kernel.org/lkml/20150921204327.GH4029@xxxxxxxxxxxxxxxxxx/ > > This code is unused for now. It will be called later in the > series. > > Signed-off-by: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx> > Signed-off-by: "Huang, Ying" <ying.huang@xxxxxxxxx> > Reviewed-by: Yang Shi <shy828301@xxxxxxxxx> > Reviewed-by: Oscar Salvador <osalvador@xxxxxxx> > Cc: Michal Hocko <mhocko@xxxxxxxx> > Cc: Wei Xu <weixugc@xxxxxxxxxx> > Cc: Zi Yan <ziy@xxxxxxxxxx> > Cc: David Rientjes <rientjes@xxxxxxxxxx> > Cc: Dan Williams <dan.j.williams@xxxxxxxxx> > Cc: David Hildenbrand <david@xxxxxxxxxx> > > -- > > Changes from 20210618: > * Merge patches for data structure definition and initialization > * Move RCU usage from the next patch in series per Zi's comments > > Changes from 20210302: > * Fix typo in node_demotion[] comment > > Changes since 20200122: > * Make node_demotion[] __read_mostly > * Add big node_demotion[] comment > > Changes in July 2020: > - Remove loop from next_demotion_node() and get_online_mems(). > This means that the node returned by next_demotion_node() > might now be offline, but the worst case is that the > allocation fails. That's fine since it is transient. > --- > mm/internal.h | 5 ++ > mm/migrate.c | 216 ++++++++++++++++++++++++++++++++++++++++++++++++ > mm/page_alloc.c | 2 +- > 3 files changed, 222 insertions(+), 1 deletion(-) LGTM. Reviewed-by: Zi Yan <ziy@xxxxxxxxxx> — Best Regards, Yan, Zi
Attachment:
signature.asc
Description: OpenPGP digital signature