On 2/20/24 12:42 AM, Michal Hocko wrote: > On Mon 19-02-24 20:37:17, Donet Tom wrote: >> >> On 2/19/24 19:50, Michal Hocko wrote: >>> On Sat 17-02-24 01:31:35, Donet Tom wrote: >>> [...] >>>> +static inline bool mpol_preferred_should_numa_migrate(int exec_node, int folio_node, >>>> + struct mempolicy *pol) >>>> +{ >>>> + /* if the executing node is in the policy node mask, migrate */ >>>> + if (node_isset(exec_node, pol->nodes)) >>>> + return true; >>>> + >>>> + /* If the folio node is in policy node mask, don't migrate */ >>>> + if (node_isset(folio_node, pol->nodes)) >>>> + return false; >>>> + /* >>>> + * both the folio node and executing node are outside the policy nodemask, >>>> + * migrate as normal numa fault migration. >>>> + */ >>>> + return true; >>>> +} >>> I have looked at this again and only now noticed that this doesn't >>> really work as one would expected. >>> >>> case MPOL_PREFERRED_MANY: >>> /* >>> * use current page if in policy nodemask, >>> * else select nearest allowed node, if any. >>> * If no allowed nodes, use current [!misplaced]. >>> */ >>> if (node_isset(curnid, pol->nodes)) >>> goto out; >>> z = first_zones_zonelist( >>> node_zonelist(numa_node_id(), GFP_HIGHUSER), >>> gfp_zone(GFP_HIGHUSER), >>> &pol->nodes); >>> polnid = zone_to_nid(z->zone); >>> break; >>> >>> Will collapse the whole MPOL_PREFERRED_MANY nodemask into the first >>> notde into that mask. Is that really what we want here? Shouldn't we use >>> the full nodemask as the migration target? >> >> With this patch it will take full nodemask and find out the correct migration target. It will not collapse into first node. > > Correct me if I am wrong, but mpol_misplaced will return the first node > of the preffered node mask and then migrate_misplaced_folio would use > it as a target node for alloc_misplaced_dst_folio which performs > __GFP_THISNODE allocation so it won't fall back to a different node. I think the confusion is between MPOL_F_MOF (migrate on fault) vs MPOL_F_MORON( protnone fault/numa fault). With MPOL_F_MOF alone what we wanted to achieve was to have have mbind() lazy migrate the pages based on policy node mask. The change was introduced in commit commit b24f53a0bea3 ("mm: mempolicy: Add MPOL_MF_LAZY") and later dropped by commit 2cafb582173f ("mempolicy: remove confusing MPOL_MF_LAZY dead code"). We still have mpol_misplaced changes to handle the node selection for MPOL_F_MOF flag (this is dead code IIUC). MPOL_F_MORON was added in commit 5606e3877ad8 ("mm: numa: Migrate on reference policy") and with currently upstream only MPOL_BIND support that flag. With that flag specified and with the changes in the patch mpol_misplaced becomes case MPOL_PREFERRED_MANY: if (pol->flags & MPOL_F_MORON) { if (!mpol_preferred_should_numa_migrate(thisnid, curnid, pol)) goto out; break; } /* * use current page if in policy nodemask, * else select nearest allowed node, if any. * If no allowed nodes, use current [!misplaced]. */ if (node_isset(curnid, pol->nodes)) goto out; z = first_zones_zonelist( node_zonelist(thisnid, GFP_HIGHUSER), gfp_zone(GFP_HIGHUSER), &pol->nodes); polnid = zone_to_nid(z->zone); break; .... ... } /* Migrate the folio towards the node whose CPU is referencing it */ if (pol->flags & MPOL_F_MORON) { polnid = thisnid; if (!should_numa_migrate_memory(current, folio, curnid, thiscpu)) goto out; } if (curnid != polnid) ret = polnid; out: mpol_cond_put(pol); return ret; } ie, if we can do numa migration, we select the currently executing node as the target node otherwise we end up returning from the function with ret = NUMA_NO_NODE. -aneesh