Re: [PATCH -V10 1/9] mm/numa: automatically generate node migration order

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 15 Jul 2021, at 1:51, Huang Ying wrote:

> From: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>
>
> Prepare for the kernel to auto-migrate pages to other memory nodes
> with a node migration table. This allows creating single migration
> target for each NUMA node to enable the kernel to do NUMA page
> migrations instead of simply discarding colder pages. A node with no
> target is a "terminal node", so reclaim acts normally there.  The
> migration target does not fundamentally _need_ to be a single node,
> but this implementation starts there to limit complexity.
>
> When memory fills up on a node, memory contents can be
> automatically migrated to another node.  The biggest problems are
> knowing when to migrate and to where the migration should be
> targeted.
>
> The most straightforward way to generate the "to where" list would
> be to follow the page allocator fallback lists.  Those lists
> already tell us if memory is full where to look next.  It would
> also be logical to move memory in that order.
>
> But, the allocator fallback lists have a fatal flaw: most nodes
> appear in all the lists.  This would potentially lead to migration
> cycles (A->B, B->A, A->B, ...).
>
> Instead of using the allocator fallback lists directly, keep a
> separate node migration ordering.  But, reuse the same data used
> to generate page allocator fallback in the first place:
> find_next_best_node().
>
> This means that the firmware data used to populate node distances
> essentially dictates the ordering for now.  It should also be
> architecture-neutral since all NUMA architectures have a working
> find_next_best_node().
>
> RCU is used to allow lock-less read of node_demotion[] and prevent
> demotion cycles been observed.  If multiple reads of node_demotion[]
> are performed, a single rcu_read_lock() must be held over all reads to
> ensure no cycles are observed.  Details are as follows.
>
> === What does RCU provide? ===
>
> Imaginge a simple loop which walks down the demotion path looking

s/Imaginge/Imagine

> for the last node:
>
>         terminal_node = start_node;
>         while (node_demotion[terminal_node] != NUMA_NO_NODE) {
>                 terminal_node = node_demotion[terminal_node];
>         }
>
> The initial values are:
>
>         node_demotion[0] = 1;
>         node_demotion[1] = NUMA_NO_NODE;
>
> and are updated to:
>
>         node_demotion[0] = NUMA_NO_NODE;
>         node_demotion[1] = 0;
>
> What guarantees that the cycle is not observed:
>
>         node_demotion[0] = 1;
>         node_demotion[1] = 0;
>
> and would loop forever?
>
> With RCU, a rcu_read_lock/unlock() can be placed around the
> loop.  Since the write side does a synchronize_rcu(), the loop
> that observed the old contents is known to be complete before the
> synchronize_rcu() has completed.
>
> RCU, combined with disable_all_migrate_targets(), ensures that
> the old migration state is not visible by the time
> __set_migration_target_nodes() is called.
>
> === What does READ_ONCE() provide? ===
>
> READ_ONCE() forbids the compiler from merging or reordering
> successive reads of node_demotion[].  This ensures that any
> updates are *eventually* observed.
>
> Consider the above loop again.  The compiler could theoretically
> read the entirety of node_demotion[] into local storage
> (registers) and never go back to memory, and *permanently*
> observe bad values for node_demotion[].
>
> Note: RCU does not provide any universal compiler-ordering
> guarantees:
>
> 	https://lore.kernel.org/lkml/20150921204327.GH4029@xxxxxxxxxxxxxxxxxx/
>
> This code is unused for now.  It will be called later in the
> series.
>
> Signed-off-by: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>
> Signed-off-by: "Huang, Ying" <ying.huang@xxxxxxxxx>
> Reviewed-by: Yang Shi <shy828301@xxxxxxxxx>
> Reviewed-by: Oscar Salvador <osalvador@xxxxxxx>
> Cc: Michal Hocko <mhocko@xxxxxxxx>
> Cc: Wei Xu <weixugc@xxxxxxxxxx>
> Cc: Zi Yan <ziy@xxxxxxxxxx>
> Cc: David Rientjes <rientjes@xxxxxxxxxx>
> Cc: Dan Williams <dan.j.williams@xxxxxxxxx>
> Cc: David Hildenbrand <david@xxxxxxxxxx>
>
> --
>
> Changes from 20210618:
>  * Merge patches for data structure definition and initialization
>  * Move RCU usage from the next patch in series per Zi's comments
>
> Changes from 20210302:
>  * Fix typo in node_demotion[] comment
>
> Changes since 20200122:
>  * Make node_demotion[] __read_mostly
>  * Add big node_demotion[] comment
>
> Changes in July 2020:
>  - Remove loop from next_demotion_node() and get_online_mems().
>    This means that the node returned by next_demotion_node()
>    might now be offline, but the worst case is that the
>    allocation fails.  That's fine since it is transient.
> ---
>  mm/internal.h   |   5 ++
>  mm/migrate.c    | 216 ++++++++++++++++++++++++++++++++++++++++++++++++
>  mm/page_alloc.c |   2 +-
>  3 files changed, 222 insertions(+), 1 deletion(-)

LGTM. Reviewed-by: Zi Yan <ziy@xxxxxxxxxx>


—
Best Regards,
Yan, Zi

Attachment: signature.asc
Description: OpenPGP digital signature


[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux