On November 25, 2024 at 19:33, Michal Hocko wrote:
On Sun 24-11-24 03:09:35, Junjie Fu wrote:
When handling a page fault caused by NUMA balancing (do_numa_page), it is
necessary to decide whether to migrate the current page to another node or
keep it on its current node. For pages with the MPOL_PREFERRED memory
policy, it is sufficient to check whether the first node set in the
nodemask is the same as the node where the page is currently located. If
this is the case, the page should remain in its current state. Otherwise,
migration to another node should be attempted.
Because the definition of MPOL_PREFERRED is as follows: "This mode sets the
preferred node for allocation. The kernel will try to allocate pages from
this node first and fall back to nearby nodes if the preferred node is low
on free memory. If the nodemask specifies more than one node ID, the first
node in the mask will be selected as the preferred node."
Thus, if the node where the current page resides is not the first node in
the nodemask, it is not the PREFERRED node, and memory migration can be
attempted.
However, in the original code, the check only verifies whether the current
node exists in the nodemask (which may or may not be the first node in the
mask). This could lead to a scenario where, if the current node is not the
first node in the nodemask, the code incorrectly decides not to attempt
migration to other nodes.
This behavior is clearly incorrect. If the target node for migration and
the page's current NUMA node are both within the nodemask but neither is
the first node, they should be treated with the same priority, and
migration attempts should proceed.
The code is clearly confusing but is there any actual problem to be
solved? IIRC although we do keep nodemask for MPOL_PREFERRED
policy we do not allow to set more than a single node to be set there.
Have a look at mpol_new_preferred
I apologize for the oversight when reviewing the code regarding the
process of setting only the first node in the nodemask for the
MPOL_PREFERRED memory policy. After reviewing the mpol_new_preferred
function, I realized that when setting the memory policy, only the first
node from the user's nodemask is copied into the corresponding memory
policy instance's nodemask, as shown in the following code:
static int mpol_new_preferred(struct mempolicy *pol, const nodemask_t
*nodes)
{
if (nodes_empty(*nodes))
return -EINVAL;
nodes_clear(pol->nodes);
node_set(first_node(*nodes), pol->nodes); //only the first node to
be set
return 0;
}
Due to my previous oversight, I mistakenly assumed that multiple nodes
could be set in pol->nodes, leading to my incorrect understanding.
Therefore, the original code is correct. Thank you all for your responses.