On Wed, Apr 14, 2021 at 02:50:53PM +0200, Michal Hocko wrote: > On Wed 17-03-21 11:40:00, Feng Tang wrote: > > From: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx> > > > > MPOL_PREFERRED honors only a single node set in the nodemask. Add the > > bare define for a new mode which will allow more than one. > > > > The patch does all the plumbing without actually adding the new policy > > type. > > > > v2: > > Plumb most MPOL_PREFERRED_MANY without exposing UAPI (Ben) > > Fixes for checkpatch (Ben) > > > > Link: https://lore.kernel.org/r/20200630212517.308045-4-ben.widawsky@xxxxxxxxx > > Co-developed-by: Ben Widawsky <ben.widawsky@xxxxxxxxx> > > Signed-off-by: Ben Widawsky <ben.widawsky@xxxxxxxxx> > > Signed-off-by: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx> > > Signed-off-by: Feng Tang <feng.tang@xxxxxxxxx> > > --- > > mm/mempolicy.c | 46 ++++++++++++++++++++++++++++++++++++++++------ > > 1 file changed, 40 insertions(+), 6 deletions(-) > > > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > > index 2b1e0e4..1228d8e 100644 > > --- a/mm/mempolicy.c > > +++ b/mm/mempolicy.c > > @@ -31,6 +31,9 @@ > > * but useful to set in a VMA when you have a non default > > * process policy. > > * > > + * preferred many Try a set of nodes first before normal fallback. This is > > + * similar to preferred without the special case. > > + * > > * default Allocate on the local node first, or when on a VMA > > * use the process policy. This is what Linux always did > > * in a NUMA aware kernel and still does by, ahem, default. > > @@ -105,6 +108,8 @@ > > > > #include "internal.h" > > > > +#define MPOL_PREFERRED_MANY MPOL_MAX > > + > > /* Internal flags */ > > #define MPOL_MF_DISCONTIG_OK (MPOL_MF_INTERNAL << 0) /* Skip checks for continuous vmas */ > > #define MPOL_MF_INVERT (MPOL_MF_INTERNAL << 1) /* Invert check for nodemask */ > > @@ -175,7 +180,7 @@ struct mempolicy *get_task_policy(struct task_struct *p) > > static const struct mempolicy_operations { > > int (*create)(struct mempolicy *pol, const nodemask_t *nodes); > > void (*rebind)(struct mempolicy *pol, const nodemask_t *nodes); > > -} mpol_ops[MPOL_MAX]; > > +} mpol_ops[MPOL_MAX + 1]; > > > > static inline int mpol_store_user_nodemask(const struct mempolicy *pol) > > { > > @@ -415,7 +420,7 @@ void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new) > > mmap_write_unlock(mm); > > } > > > > -static const struct mempolicy_operations mpol_ops[MPOL_MAX] = { > > +static const struct mempolicy_operations mpol_ops[MPOL_MAX + 1] = { > > [MPOL_DEFAULT] = { > > .rebind = mpol_rebind_default, > > }, > > @@ -432,6 +437,10 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = { > > .rebind = mpol_rebind_nodemask, > > }, > > /* [MPOL_LOCAL] - see mpol_new() */ > > + [MPOL_PREFERRED_MANY] = { > > + .create = NULL, > > + .rebind = NULL, > > + }, > > }; > > I do get that you wanted to keep MPOL_PREFERRED_MANY unaccessible for > the userspace but wouldn't it be much easier to simply check in two > syscall entries rather than playing thise MAX+1 games which make the > review more complicated than necessary? I will check this way, and currently the user input paramter handling are quite complex. Also the sanity check in kernel_mbind() and kernel_set_mempolicy() are almost identical, which can be unified. > > > > static int migrate_page_add(struct page *page, struct list_head *pagelist, > > @@ -924,6 +933,9 @@ static void get_policy_nodemask(struct mempolicy *p, nodemask_t *nodes) > > case MPOL_INTERLEAVE: > > *nodes = p->v.nodes; > > break; > > + case MPOL_PREFERRED_MANY: > > + *nodes = p->v.preferred_nodes; > > + break; > > case MPOL_PREFERRED: > > if (!(p->flags & MPOL_F_LOCAL)) > > *nodes = p->v.preferred_nodes; > > Why those two do a slightly different thing? Is this because unlike > MPOL_PREFERRED it can never have MPOL_F_LOCAL cleared? If that is the > case I would still stick the two together and use the same code for > both to make the code easier to follow. Now that both use the same > nodemask it should really be just about syscall inputs sanitization and > to keep the original behavior for MPOL_PREFERRED. > > [...] Our intention is to make MPOL_PREFERRED_MANY be similar to MPOL_PREFERRED, except it perfers multiple nodes. So will try to achieve this in following version. Also for MPOL_LOCAL and MPOL_PREFERRED, current code logic is turning 'MPOL_LOCAL' to 'MPOL_PREFERRED' with MPOL_F_LOCAL set. I don't understand why not use the other way around, that turning MPOL_PREFERRED with empty nodemask to MPOL_LOCAL, which looks more logical. Thanks, Feng > > @@ -2072,6 +2087,9 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask) > > task_lock(current); > > mempolicy = current->mempolicy; > > switch (mempolicy->mode) { > > + case MPOL_PREFERRED_MANY: > > + *mask = mempolicy->v.preferred_nodes; > > + break; > > case MPOL_PREFERRED: > > if (mempolicy->flags & MPOL_F_LOCAL) > > nid = numa_node_id(); > > Same here