Re: [RFC PATCH] mm/mempolicy: add MPOL_PREFERRED_STRICT memory policy

"Aneesh Kumar K.V" <aneesh.kumar@xxxxxxxxxxxxx> · Wed, 13 Oct 2021 19:27:03 +0530

On 10/13/21 18:28, Aneesh Kumar K.V wrote:
On 10/13/21 18:20, Michal Hocko wrote:
On Wed 13-10-21 18:05:49, Aneesh Kumar K.V wrote:
On 10/13/21 16:18, Michal Hocko wrote:
On Wed 13-10-21 12:42:34, Michal Hocko wrote:
[Cc linux-api]

On Wed 13-10-21 15:15:39, Aneesh Kumar K.V wrote:
This mempolicy mode can be used with either the set_mempolicy(2)
or mbind(2) interfaces.  Like the MPOL_PREFERRED interface, it
allows an application to set a preference node from which the kernel
will fulfill memory allocation requests. Unlike the MPOL_PREFERRED 
mode,
it takes a set of nodes. The nodes in the nodemask are used as 
fallback
allocation nodes if memory is not available on the preferred node.
Unlike MPOL_PREFERRED_MANY, it will not fall back memory allocations
to all nodes in the system. Like the MPOL_BIND interface, it works 
over a
set of nodes and will cause a SIGSEGV or invoke the OOM killer if
memory is not available on those preferred nodes.

This patch helps applications to hint a memory allocation 
preference node
and fallback to _only_ a set of nodes if the memory is not available
on the preferred node.  Fallback allocation is attempted from the 
node which is
nearest to the preferred node.

This new memory policy helps applications to have explicit control 
on slow
memory allocation and avoids default fallback to slow memory NUMA 
nodes.
The difference with MPOL_BIND is the ability to specify a 
preferred node
which is the first node in the nodemask argument passed.

I am sorry but I do not understand the semantic diffrence from
MPOL_BIND. Could you be more specific please?

MPOL_BIND
    This mode specifies that memory must come from the set of
    nodes specified by the policy.  Memory will be allocated from
    the node in the set with sufficient free memory that is
    closest to the node where the allocation takes place.

MPOL_PREFERRED_STRICT
    This mode specifies that the allocation should be attempted
    from the first node specified in the nodemask of the policy.
    If that allocation fails, the kernel will search other nodes
    in the nodemask, in order of increasing distance from the
    preferred node based on information provided by the platform   
firmware.

The difference is the ability to specify the preferred node as the first
node in the nodemask and all fallback allocations are based on the 
distance
from the preferred node. With MPOL_BIND they base based on the node 
where
the allocation takes place.

OK, this makes it more clear. Thanks!

I am still not sure the semantic makes sense though. Why should
the lowest node in the nodemask have any special meaning? What if it is
a node with a higher number that somebody preferes to start with?

That is true. I haven't been able to find an easy way to specify the 
preferred node other than expressing it as first node in the node mask. 
Yes, it limits the usage of the policy. Any alternate suggestion?

We could do
set_mempolicy(MPOLD_PREFERRED, nodemask(nodeX)))
set_mempolicy(MPOLD_PREFFERED_EXTEND, nodemask(fallback nodemask for 
above PREFERRED policy))

But that really complicates the interface?

Another option is to keep this mbind(2) specific and overload flags to 
be the preferred nodeid.

mbind(va, len, MPOL_PREFERRED_STRICT, nodemask, max_node, preferred_node);

 -aneesh