"Huang, Ying" <ying.huang@xxxxxxxxx> writes: > Donet Tom <donettom@xxxxxxxxxxxxx> writes: > >> commit bda420b98505 ("numa balancing: migrate on fault among multiple bound >> nodes") added support for migrate on protnone reference with MPOL_BIND >> memory policy. This allowed numa fault migration when the executing node >> is part of the policy mask for MPOL_BIND. This patch extends migration >> support to MPOL_PREFERRED_MANY policy. >> >> Currently, we cannot specify MPOL_PREFERRED_MANY with the mempolicy flag >> MPOL_F_NUMA_BALANCING. This causes issues when we want to use >> NUMA_BALANCING_MEMORY_TIERING. To effectively use the slow memory tier, >> the kernel should not allocate pages from the slower memory tier via >> allocation control zonelist fallback. Instead, we should move cold pages >> from the faster memory node via memory demotion. For a page allocation, >> kswapd is only woken up after we try to allocate pages from all nodes in >> the allocation zone list. This implies that, without using memory >> policies, we will end up allocating hot pages in the slower memory tier. >> >> MPOL_PREFERRED_MANY was added by commit b27abaccf8e8 ("mm/mempolicy: add >> MPOL_PREFERRED_MANY for multiple preferred nodes") to allow better >> allocation control when we have memory tiers in the system. With >> MPOL_PREFERRED_MANY, the user can use a policy node mask consisting only >> of faster memory nodes. When we fail to allocate pages from the faster >> memory node, kswapd would be woken up, allowing demotion of cold pages >> to slower memory nodes. >> >> With the current kernel, such usage of memory policies implies we can't >> do page promotion from a slower memory tier to a faster memory tier >> using numa fault. This patch fixes this issue. >> >> For MPOL_PREFERRED_MANY, if the executing node is in the policy node >> mask, we allow numa migration to the executing nodes. If the executing >> node is not in the policy node mask but the folio is already allocated >> based on policy preference (the folio node is in the policy node mask), >> we don't allow numa migration. If both the executing node and folio node >> are outside the policy node mask, we allow numa migration to the >> executing nodes. >> >> Signed-off-by: Aneesh Kumar K.V (IBM) <aneesh.kumar@xxxxxxxxxx> >> Signed-off-by: Donet Tom <donettom@xxxxxxxxxxxxx> >> --- >> mm/mempolicy.c | 28 ++++++++++++++++++++++++++-- >> 1 file changed, 26 insertions(+), 2 deletions(-) >> >> diff --git a/mm/mempolicy.c b/mm/mempolicy.c >> index 73d698e21dae..8c4c92b10371 100644 >> --- a/mm/mempolicy.c >> +++ b/mm/mempolicy.c >> @@ -1458,9 +1458,10 @@ static inline int sanitize_mpol_flags(int *mode, unsigned short *flags) >> if ((*flags & MPOL_F_STATIC_NODES) && (*flags & MPOL_F_RELATIVE_NODES)) >> return -EINVAL; >> if (*flags & MPOL_F_NUMA_BALANCING) { >> - if (*mode != MPOL_BIND) >> + if (*mode == MPOL_BIND || *mode == MPOL_PREFERRED_MANY) >> + *flags |= (MPOL_F_MOF | MPOL_F_MORON); >> + else >> return -EINVAL; >> - *flags |= (MPOL_F_MOF | MPOL_F_MORON); >> } >> return 0; >> } >> @@ -2463,6 +2464,23 @@ static void sp_free(struct sp_node *n) >> kmem_cache_free(sn_cache, n); >> } >> >> +static inline bool mpol_preferred_should_numa_migrate(int exec_node, int folio_node, >> + struct mempolicy *pol) >> +{ >> + /* if the executing node is in the policy node mask, migrate */ >> + if (node_isset(exec_node, pol->nodes)) >> + return true; >> + >> + /* If the folio node is in policy node mask, don't migrate */ >> + if (node_isset(folio_node, pol->nodes)) >> + return false; >> + /* >> + * both the folio node and executing node are outside the policy nodemask, >> + * migrate as normal numa fault migration. >> + */ >> + return true; > > Why? This may cause some unexpected result. For example, pages may be > distributed among multiple sockets unexpectedly. So, I prefer the more > conservative policy, that is, only migrate if this node is in > pol->nodes. > This will only have an impact if the user specifies MPOL_F_NUMA_BALANCING. This means that the user is explicitly requesting for frequently accessed memory pages to be migrated. Memory policy MPOL_PREFERRED_MANY is able to allocate pages from nodes outside of policy->nodes. For the specific use case that I am interested in, it should be okay to restrict it to policy->nodes. However, I am wondering if this is too restrictive given the definition of MPOL_PREFERRED_MANY. -aneesh