Re: [PATCH v2 00/11] mempolicy2, mbind2, and weighted interleave

"Huang, Ying" <ying.huang@xxxxxxxxx> · Mon, 11 Dec 2023 13:53:40 +0800

Hi, Gregory,

Thanks for updated version!

Gregory Price <gourry.memverge@xxxxxxxxx> writes:

> v2:
>   changes / adds:
> - flattened weight matrix to an array at requested of Ying Huang
> - Updated ABI docs per Davidlohr Bueso request
> - change uapi structure to use aligned/fixed-length members as
>   Suggested-by: Arnd Bergmann <arnd@xxxxxxxx>
> - Implemented weight fetch logic in get_mempolicy2
> - mbind2 was changed to take (iovec,len) as function arguments
>   rather than add them to the uapi structure, since they describe
>   where to apply the mempolicy - as opposed to being part of it.
>
>   fixes:
> - fixed bug on fork/pthread
>   Reported-by: Seungjun Ha <seungjun.ha@xxxxxxxxxxx>
>   Link: https://lore.kernel.org/linux-cxl/20231206080944epcms2p76ebb230b9f4595f5cfcd2531d67ab3ce@epcms2p7/
> - fixed bug in mbind2 where MPOL_F_GWEIGHTS was not set when il_weights
>   was omitted after local weights were added as an option
> - fixed bug in interleave logic where an OOB access was made if
>   next_node_in returned MAX_NUMNODES
> - fixed bug in bulk weighted interleave allocator where over-allocation
>   could occur.
>
>   tests:
> - LTP: validated existing get_mempolicy, set_mempolicy, and mbind testss
> - LTP: validated existing get_mempolicy, set_mempolicy, and mbind with
>        MPOL_WEIGHTED_INTERLEAVE added.
> - basic set_mempolicy2 tests and numactl -w --interleave tests
>
>   numactl:
> - Sample numactl extension for set_mempolicy available here:
>   Link: https://github.com/gmprice/numactl/tree/weighted_interleave_master
>
> (summary of LTP tests and manual tests added to end of cover letter)
>
> =====================================================================
>
> This patch set extends the mempolicy interface to enable new
> mempolicies which may require extended data to operate. One
> such policy is included with this set as an example.
>
> There are 3 major "phases" in the patch set:
> 1) Implement a "global weight" mechanism via sysfs, which allows
>    set_mempolicy to implement MPOL_WEIGHTED_INTERLEAVE utilizing
>    weights set by the administrator (or system daemon).
>
> 2) A refactor of the mempolicy creation mechanism to accept an
>    extensible argument structure `struct mempolicy_args` to promote
>    code re-use between the original mempolicy/mbind interfaces and
>    the new extended mempolicy/mbind interfaces.
>
> 3) Implementation of set_mempolicy2, get_mempolicy2, and mbind2,
>    along with the addition of task-local weights so that per-task
>    weights can be registered for MPOL_WEIGHTED_INTERLEAVE.
>
> =====================================================================
> (Patch 1) : sysfs addition - /sys/kernel/mm/mempolicy/
>
> This feature  provides a way to set interleave weight information under
> sysfs at /sys/kernel/mm/mempolicy/weighted_interleave/nodeN/nodeM/weight
>
>     The sysfs structure is designed as follows.
>
>       $ tree /sys/kernel/mm/mempolicy/
>       /sys/kernel/mm/mempolicy/
>       ├── possible_nodes
>       └── weighted_interleave
>           ├── nodeN
>           │   └── weight
>           └── nodeN+X
>               └── weight
>
> 'mempolicy' is added to '/sys/kernel/mm/' as a control group for
> the mempolicy subsystem.

Is it good to add 'mempolicy' in '/sys/kernel/mm/numa'?  The advantage
is that 'mempolicy' here is in fact "NUMA mempolicy".  The disadvantage
is one more directory nesting.  I have no strong opinion here.

> 'possible_nodes' is added to 'mm/mempolicy' to help describe the
> expected structures under mempolicy directorys. For example,
> possible_nodes describes what nodeN directories wille exist under
> the weighted_interleave directory.

We have '/sys/devices/system/node/possible' already.  Is this just a
duplication?  If so, why?  And, the possible nodes can be gotten via
contents of 'weighted_interleave' too.

And it appears not necessary to make 'weighted_interleave/nodeN'
directory.  Why not just make it a file.

And, can we add a way to reset weight to the default value?  For example
`echo > nodeN/weight` or `echo > nodeN`.

> Internally, weights are represented as an array of unsigned char
>
> static unsigned char iw_table[MAX_NUMNODES];
>
> char was chosen as most reasonable distributions can be represented
> as factors <100, and to minimize memory usage (1KB)
>
> We present possible nodes, instead of online nodes, to simplify the
> management interface, considering that a) the table is of size
> MAX_NUMNODES anyway to simplify fetching of weights (no need to track
> sizes, and MAX_NUMNODES is typically at most 1kb), and b) it simplifies
> management of hotplug events, allowing for weights to be set prior to
> a node coming online, which may be beneficial for immediate use.
>
> the 'weight' of a node (an unsigned char of value 1-255) is the number
> of pages that are allocated during a "weighted interleave" round.
> (See 'weighted interleave' for more details').
>
> =====================================================================
> (Patch 2) set_mempolicy: MPOL_WEIGHTED_INTERLEAVE
>
> Weighted interleave is a new memory policy that interleaves memory
> across numa nodes in the provided nodemask based on the weights
> described in patch 1 (sysfs global weights).
>
> When a system has multiple NUMA nodes and it becomes bandwidth hungry,
> the current MPOL_INTERLEAVE could be an wise option.
>
> However, if those NUMA nodes consist of different types of memory such
> as having local DRAM and CXL memory together, the current round-robin
> based interleaving policy doesn't maximize the overall bandwidth
> because of their different bandwidth characteristics.
>
> Instead, the interleaving can be more efficient when the allocation
> policy follows each NUMA nodes' bandwidth weight rather than having 1:1
> round-robin allocation.
>
> This patch introduces a new memory policy, MPOL_WEIGHTED_INTERLEAVE,
> which enables weighted interleaving between NUMA nodes.  Weighted
> interleave allows for a proportional distribution of memory across
> multiple numa nodes, preferablly apportioned to match the bandwidth
> capacity of each node from the perspective of the accessing node.
>
> For example, if a system has 1 CPU node (0), and 2 memory nodes (0,1),
> with a relative bandwidth of (100GB/s, 50GB/s) respectively, the
> appropriate weight distribution is (2:1).
>
> Weights will be acquired from the global weight array exposed by the
> sysfs extension: /sys/kernel/mm/mempolicy/weighted_interleave/
>
> The policy will then allocate the number of pages according to the
> set weights.  For example, if the weights are (2,1), then 2 pages
> will be allocated on node0 for every 1 page allocated on node1.
>
> The new flag MPOL_WEIGHTED_INTERLEAVE can be used in set_mempolicy(2)
> and mbind(2).
>
> =====================================================================
> (Patches 3-6) Refactoring mempolicy for code-reuse
>
> To avoid multiple paths of mempolicy creation, we should refactor the
> existing code to enable the designed extensibility, and refactor
> existing users to utilize the new interface (while retaining the
> existing userland interface).
>
> This set of patches introduces a new mempolicy_args structure, which
> is used to more fully describe a requested mempolicy - to include
> existing and future extensions.
>
> /*
>  * Describes settings of a mempolicy during set/get syscalls and
>  * kernel internal calls to do_set_mempolicy()
>  */
> struct mempolicy_args {
>     unsigned short mode;            /* policy mode */
>     unsigned short mode_flags;      /* policy mode flags */
>     nodemask_t *policy_nodes;       /* get/set/mbind */
>     int policy_node;                /* get: policy node information */
>     unsigned long addr;             /* get: vma address */
>     int addr_node;                  /* get: node the address belongs to */
>     int home_node;                  /* mbind: use MPOL_MF_HOME_NODE */
>     unsigned char *il_weights;      /* for mode MPOL_WEIGHTED_INTERLEAVE */
> };
>
> This arg structure will eventually be utilized by the following
> interfaces:
>     mpol_new() - new mempolicy creation
>     do_get_mempolicy() - acquiring information about mempolicy
>     do_set_mempolicy() - setting the task mempolicy
>     do_mbind()         - setting a vma mempolicy
>
> do_get_mempolicy() is completely refactored to break it out into
> separate functionality based on the flags provided by get_mempolicy(2)
>     MPOL_F_MEMS_ALLOWED: acquires task->mems_allowed
>     MPOL_F_ADDR: acquires information on vma policies
>     MPOL_F_NODE: changes the output for the policy arg to node info
>
> We refactor the get_mempolicy syscall flatten the logic based on these
> flags, and aloow for set_mempolicy2() to re-use the underlying logic.
>
> The result of this refactor, and the new mempolicy_args structure, is
> that extensions like 'sys_set_mempolicy_home_node' can now be directly
> integrated into the initial call to 'set_mempolicy2', and that more
> complete information about a mempolicy can be returned with a single
> call to 'get_mempolicy2', rather than multiple calls to 'get_mempolicy'
>
>
> =====================================================================
> (Patches 7-10) set_mempolicy2, get_mempolicy2, mbind2
>
> These interfaces are the 'extended' counterpart to their relatives.
> They use the userland 'struct mpol_args' structure to communicate a
> complete mempolicy configuration to the kernel.  This structure
> looks very much like the kernel-internal 'struct mempolicy_args':
>
> struct mpol_args {
>         /* Basic mempolicy settings */
>         __u16 mode;
>         __u16 mode_flags;
>         __s32 home_node;
>         __aligned_u64 pol_nodes;
>         __u64 pol_maxnodes;
>         __u64 addr;
>         __s32 policy_node;
>         __s32 addr_node;
>         __aligned_u64 *il_weights;      /* of size pol_maxnodes */
> };

This looks unnecessarily complex.  I don't think that it's a good idea
to use exact same parameter for all 3 syscalls.

For example, can we use something as below?

  long set_mempolicy2(int mode, const unsigned long *nodemask, unsigned int *il_weights,
                          unsigned long maxnode, unsigned long home_node,
                          unsigned long flags);

  long mbind2(unsigned long start, unsigned long len,
                          int mode, const unsigned long *nodemask, unsigned int *il_weights,
                          unsigned long maxnode, unsigned long home_node,
                          unsigned long flags);

A struct may be defined to hold mempolicy iteself.

struct mpol {
        int mode;
        unsigned int home_node;
        const unsigned long *nodemask;
        unsigned int *il_weights;
        unsigned int maxnode;
};

> The basic mempolicy settings which are shared across all interfaces
> are captured at the top of the structure, while extensions such as
> 'policy_node' and 'addr' are collected beneath.
>
> The syscalls are uniform and defined as follows:
>
> long sys_mbind2(struct iovec *vec, size_t vlen,
>                 struct mpol_args *args, size_t usize,
>                 unsigned long flags);
>
> long sys_get_mempolicy2(struct mpol_args *args, size_t size,
>                         unsigned long flags);
>
> long sys_set_mempolicy2(struct mpol_args *args, size_t size,
>                         unsigned long flags);
>
> The 'flags' argument for mbind2 is the same as 'mbind', except with
> the addition of MPOL_MF_HOME_NODE to denote whether the 'home_node'
> field should be utilized.
>
> The 'flags' argument for get_mempolicy2 is the same as get_mempolicy.
>
> The 'flags' argument is not used by 'set_mempolicy' at this time, but
> may end up allowing the use of MPOL_MF_HOME_NODE if such functionality
> is desired.
>
> The extensions can be summed up as follows:
>
> get_mempolicy2 extensions:
>     'mode', 'policy_node', and 'addr_node' can now be fetched with
>     a single call, rather than multiple with a combination of flags.
>     - 'mode' will always return the policy mode
>     - 'policy_node' will replace the functionality of MPOL_F_NODE
>     - 'addr_node' will return the node for 'addr' w/ MPOL_F_ADDR
>
> set_mempolicy2:
>     - task-local interleave weights can be set via 'il_weights'
>       (see next patch)
>
> mbind2:
>     - 'vec' and 'vlen' are sed to operate on multiple memory
>        ranges, rather than a single memory range per syscall.
>     - 'home_node' field sets policy home node w/ MPOL_MF_HOME_NODE
>     - task-local interleave weights can be set via 'il_weights'
>       (see next patch)
>

--
Best Regards,
Huang, Ying

[snip]