[RFC] new mempolicy syscalls (mempolicy2 and pidfd)

Gregory Price <gregory.price@xxxxxxxxxxxx> · Wed, 29 Nov 2023 16:56:01 -0500

Hi all,

I am sending this to pull together notes from a few months
of different proposals on mempolicy extensions, and some of
the feedback that has been given.

There has been some consensus on the implementation of new
mempolicy syscalls, which also gives us an opportunity to 
clean up some warts in the old ones.

This summarizes the current plan.

Hoping to get some feedback on the interfaces before we get
too far into building out the whole thing.  Ultimately the
extensions proposed here are not too complicated, but locking
in syscall interfaces is always contentious.

==========================================================
Background:

Presently there exist 4 syscall interfaces with which a user
can interact with the current task mempolicy, or the current
task's vma mempolicies:

set_mempolicy :
        set a task's mempolicy

get_mempolicy :
        4 possible operations w/ some variants
        * fetch task->mems_allowed
        * fetch task->policy (mode, flag, nodemask)
        * fetch vma->policy for address
                * possibly the node for that address as well
        * fetch next interleave node for mempolicy

set_mempolicy_home_node :
        set VMA's home node and possibly migrate

mbind :
        set VMA mempolicy and possibly migrate

There have been a few proposals for extending this:

1) make mempolicies modifiable from external tasks (pidfd)
        - both task and vma policies

2) make mempolicy more extensible (weighted interleave)
        - requires new fields, and therefore new syscall

3) Clean up set_mempolicy flag warts

This RFC proposes the following interfaces
        set_mempolicy2
        get_mempolicy2
        get_vma_mempolicy
        process_set_mempolicy
        process_get_mempolicy
        process_get_vma_mempolicy

And implies the eventual creation of the following:
        set_vma_mempolicy
        process_set_vma_mempolicy

This RFC captures 4 different, but related proposals.

Proposal 1: new intefaces should reduce flags
Proposal 2: making the mempolicies externally modifiable
Proposal 3: new syscall interfaces
Proposal 4: weighted interleave

==========================================================
Proposal 1: Reducing flags in favor of explicit interfaces

Right now, set_mempolicy and get_mempolicy are two very different
interfaces with very different behaviors based on mode flags and
syscall flags. Depending on the flags, they may not even access
the mempolicy at all!

For example, the behavior of get_mempolicy can be changed to
retrieve task->mems_allowed, or replace the policy argument
output a node.

Both of these are awful warts.  One expects a get_mempolicy
syscall to... get the mempolicy.  Instead, mems_allowed is more
of a cgroups/cpuset policy - which while related and affects the
mempolicy, is not actually part of (task->mempolicy) or a vma
mempolicy.

The defines for these flags are also confusing and difficult
not intuitive at a glance:

Mode flags: or'd into the *mode* in set_mempolicy(mode, ...)
#define MPOL_F_STATIC_NODES     (1 << 15)
#define MPOL_F_RELATIVE_NODES   (1 << 14)
#define MPOL_F_NUMA_BALANCING   (1 << 13)

get_mempolicy Syscall flags: passed into get_mempolicy as an arg
#define MPOL_F_NODE     (1<<0)
#define MPOL_F_ADDR     (1<<1)
#define MPOL_F_MEMS_ALLOWED (1<<2)

I'd like to do away with as many of the operations flags as possible,
if not all of them, and define new interfaces with explicit extensions
in an argument structure and/or separate syscalls all together to
separate out the funtionality.

Example:

struct mpol_args {
        int mode;
        unsigned short mode_flags;   /* STATIC/RELATIVE_NODEs, etc */
        unsigned long *policy_nodemask;
        unsigned long policy_maxnode;
        unsigned long *mems_allowed; /* replace MPOL_F_MEMS_ALLOWED */
        unsigned long mems_maxnode;
        unsigned long policy_node;   /* replace MPOL_F_NODE */
        unsigned long vma_node;      /* replace MPOL_F_NODE */
	/* Replace MPOL_F_ADDR with a new syscall */
};

In this setup, the following is true:

a) mode and mode_flags operate the same as the original, except
   that the space is no longer shared.  Internally, these fields
   are stored separately in `struct mempolicy` anyway... so why
   is the user interface overloading mode? Fix it.

b) MPOL_F_ADDR (op_flag) is deprecated, instead implment
   get_vma_mempolicy(addr, mpol_args, size) (see next section)

c) MPOL_F_MEMS_ALLOWED is deprecated. Instead, add the field
   (args->mems_allowed).  If the pointer is set, fetch the
   mems allowed. If not, don't!

   (This also lets us refactor the old interface in terms of the
   new one, which is a bonus for maintainability).

d) MPOL_F_NODE is deprecated in favor of adding a policy_node field.
   policy_node is filled based policy.  For example, if the active
   policy is MPOL_INTERLEAVE, we set policy_node to the next node.

   if (pol->mode == MPOL_INTERLEAVE)
     policy_node = next_numa_node(task->il_prev, pol->nodes);

e) vma_node is fetched if a vma policy is accessed.  vma_node is
   otherwise ignored by get_mempolicy2, and is only filled by
   get_vma_mempolicy.

   again, the inclusion in mpol_args allows us to refactor the
   old interface in terms of the new one to reduce maintenance
   issues - but that may come over time.

Ultimately this gives us a single extensible struct with flexiblity
on how to develop new interfaces around it.

Questions:
1) Do we use 1 struct (mpol_args), or multiple.  Specifically do
   we define different structures for get/set, and do we define
   different structures for task vs vma mempolicies?

   Here i'm proposing 1 structure, because the core output (mode,
   flags, nodemask) is the same, but the location from where
   the data is fetched is different (task vs vma).

   Additionally, get/set can take the same arg structure and
   simply ignore the non-relevant fields. For example, the
   set_* interfaces would ignore (args->mems_allowed), because
   that's not mutable from this interface.

2) Having handled F_NODE, F_ADDR, and F_MEMS_ALLOWED like above,
   I don't see the need for (args->op_flags), or a flags argument
   as part of the syscall interface -  but folks love their flag
   fields - should it stay or be omitted?

   Personally i think this should be omitted, as it can always
   be appended later if really needed.

==========================================================
Proposal 2: Making mempolicy externally modifiable

This has been proposed and suggeted by a variety of sources

pidfd_set_mempolicy
https://lore.kernel.org/linux-mm/20221010094842.4123037-1-hezhongkun.hzk@xxxxxxxxxxxxx/

process_mbind
https://lore.kernel.org/linux-mm/ZV50MX4STKRCohiB@xxxxxxxxxxxxxxxxxx/

*_task_mempolicy syscalls
https://lore.kernel.org/linux-mm/20231122211200.31620-1-gregory.price@xxxxxxxxxxxx/
https://lore.kernel.org/linux-mm/20231122211200.31620-8-gregory.price@xxxxxxxxxxxx/

Unfortunately the mbind/home_node interfaces have to be deferred to
a later patch set, due to the complexities of plumbing the task
reference through multiple sub-systems, but this does not majorly
affect the interface itself or its arguments, just the internal
plumbing.

Context on process_set_mempolicy_home_node and process_mbind:
https://lore.kernel.org/linux-mm/ZWYsth2CtC4Ilvoz@xxxxxxxxxxxx/

For this reason, I would forego the implementation of set_vma_mempolicy
for the time-being - especially since mbind and set_mempolicy_home_node
already exist.  The real value here is the process variant, which we
can take time to implement separately.

So the initial proposal is for the following:

/* Set the target task mempolicy */
process_set_mempolicy(int pidf, struct mpol_args *args, size_t size);

/* Get the target task mempolicy */
process_get_mempolicy(int pidf, struct mpol_args *args, size_t size);

/* Get the target task's vma mempolicy for address */
process_get_vma_mempolicy(int pidf, unsigned long addr,
			  struct mpol_args *args, size_t size);

==========================================================
Proposal 3: New Syscalls

We are proposing the addition of 6 total new syscall interfaces
in the initial RFC, with some implications that we may add 2 more
in the future.

Current task interfaces:

set_mempolicy2(struct mpol_args *args, size_t usize);
get_mempolicy2(struct mpol_args *args, size_t usize);
get_vma_mempolicy(unsigned long addr, struct mpol_args *args, size_t usize);

Remote task interfaces:

process_set_mempolicy(int pidfd, struct mpol_args *args, size_t usize);
process_get_mempolicy(int pidfd, struct mpol_args *args, size_t usize);
process_get_vma_mempolicy(int pidfd, unsigned long addr,
                          struct mpol_args *args, size_t usize);

These interfaces allow for the retrieval of task or vma policies, but
only allow for the modification of *task* policies.

mbind/set_mempolicy_home_node allow for the mempolicy of VMA's to be
set for the current task, and the plumbing of remote-task vma policy
modification requires much deeper consideration (see proposal 2).

==========================================================
Proposal 4: Weighted Interleave

This proposal is to create a weighted interleave policy, either as an
extension of MPOL_INTERLEAVE, or separetly  (MPOL_WEIGHTED_INTERLEAVE).

The default weight would be 1 for all possible nodes, so the default
behavior of MPOL_WEIGHTED_INTERLEAVE would be MPOL_INTERLEAVE, which
was why we eventually chose to implement this just an extension of
MPOL_INTERLEAVE in some of the later RFC's below.

Basically this is summarized as two basic additions:

struct mempolicy {
        ...
        unsigned char cur_weight; /* weight of current il node */
        unsigned char *interleave_weights; /* size: MAX_NUMNODES */
};

and

struct mpol_args {
        ...
        unsigned char *interleave_weights;
};

By setting the weights of each node in nodemask, it's possible to
distribute allocations across those nodes based on the available
bandwidth of those nodes.

For example node 0 may provide 100GB/s of bandwidth, while node 1
may only provide 50GB/s of bandwidth.  In this case, traditional
1:1 interleave is a sub-optimal distribution of memory.  Setting
the weights to 2:1 would match the bandwidth distribution between
nodes (100:50) and therefore be closer to the optimal distribution.

Original N:M implementation:
https://lore.kernel.org/linux-mm/YqD0%2FtzFwXvJ1gK6@xxxxxxxxxxx/T/

original mempolicy RFC:
https://lore.kernel.org/linux-mm/20231003002156.740595-1-gregory.price@xxxxxxxxxxxx/

memtier implementation:
https://lore.kernel.org/linux-mm/20231009204259.875232-1-gregory.price@xxxxxxxxxxxx/

node-weight implementation:
https://lore.kernel.org/linux-mm/20231031003810.4532-1-gregory.price@xxxxxxxxxxxx/

cgroups/memcg implementation:
https://lore.kernel.org/linux-mm/20231109002517.106829-1-gregory.price@xxxxxxxxxxxx/

Summary of some of this work on LWN:
https://lwn.net/Articles/948037/

The general feedback across these RFC's has been that there is no
consensus on where/whether "global weights" should exist, but consistent
consensus that mempolicy->weights seems like a reasonable idea.

So we are proposing this extension first, before exploring a global
setting more generally.

Open Questiosn
1) MPOL_INTERLEAVE extension or new MPOL_WEIGHTED_INTERLEAVE ?
2) Global weight location (not relevant for this set of poposals)
        cgroup/memcg   - "doesn't belong"
        cgroup/cpusets - locking contension issue
        node - "too broad"
        memtier - "doesn't make sense / too broad"
        new sysfs entry entirely separate from the above?

==========================================================

Summarizing Open questions:

1) single arg struct (struct mpol_args) or multiple?

2) do away with operation flags (F_NODE, F_ADDR, F_MEMS_ALLOWED), or keep
   them?  (retained for old interface, just deprecated on new one).

3) Should these syscalls take a flags argument (outside mode_flags)

4) split task and vma mempolicy operations, or single interface for both?

5) Are you grossly offended by any of this? Do you have specific
   recommendations?

6) Any specific testing requirements you would like to see to make
   this quick an painless? (ktest, ltp somewhat implied).

==========================================================

Capturing all the suggested-by tags for folks who chimed in on
prior RFCs and Patches.

Suggested-by: Gregory Price <gregory.price@xxxxxxxxxxxx>
Suggested-by: Johannes Weiner <hannes@xxxxxxxxxxx>
Suggested-by: Hasan Al Maruf <hasanalmaruf@xxxxxx>
Suggested-by: Hao Wang <haowang3@xxxxxx>
Suggested-by: Ying Huang <ying.huang@xxxxxxxxx>
Suggested-by: Dan Williams <dan.j.williams@xxxxxxxxx>
Suggested-by: Michal Hocko <mhocko@xxxxxxxx>
Suggested-by: tj <tj@xxxxxxxxxx>
Suggested-by: Zhongkun He <hezhongkun.hzk@xxxxxxxxxxxxx>
Suggested-by: Frank van der Linden <fvdl@xxxxxxxxxx>
Suggested-by: John Groves <john@xxxxxxxxxxxxxx>
Suggested-by: Vinicius Tavares Petrucci <vtavarespetr@xxxxxxxxxx>
Suggested-by: Srinivasulu Thanneeru <sthanneeru@xxxxxxxxxx>
Suggested-by: Ravi Jonnalagadda <ravis.opensrc@xxxxxxxxxx>
Suggested-by: Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx>

Signed-off-by: Gregory Price <gregory.price@xxxxxxxxxxxx>

Kind Regards,
Gregory Price