Gregory Price <gourry.memverge@xxxxxxxxx> writes: > When a system has multiple NUMA nodes and it becomes bandwidth hungry, > using the current MPOL_INTERLEAVE could be an wise option. > > However, if those NUMA nodes consist of different types of memory such > as socket-attached DRAM and CXL/PCIe attached DRAM, the round-robin > based interleave policy does not optimally distribute data to make use > of their different bandwidth characteristics. > > Instead, interleave is more effective when the allocation policy follows > each NUMA nodes' bandwidth weight rather than a simple 1:1 distribution. > > This patch introduces a new memory policy, MPOL_WEIGHTED_INTERLEAVE, > enabling weighted interleave between NUMA nodes. Weighted interleave > allows for proportional distribution of memory across multiple numa > nodes, preferably apportioned to match the bandwidth of each node. > > For example, if a system has 1 CPU node (0), and 2 memory nodes (0,1), > with bandwidth of (100GB/s, 50GB/s) respectively, the appropriate > weight distribution is (2:1). > > Weights for each node can be assigned via the new sysfs extension: > /sys/kernel/mm/mempolicy/weighted_interleave/ > > For now, the default value of all nodes will be `1`, which matches > the behavior of standard 1:1 round-robin interleave. An extension > will be added in the future to allow default values to be registered > at kernel and device bringup time. > > The policy allocates a number of pages equal to the set weights. For > example, if the weights are (2,1), then 2 pages will be allocated on > node0 for every 1 page allocated on node1. > > The new flag MPOL_WEIGHTED_INTERLEAVE can be used in set_mempolicy(2) > and mbind(2). > > There are 3 integration points: > > weighted_interleave_nodes: > Counts the number of allocations as they occur, and applies the > weight for the current node. When the weight reaches 0, switch > to the next node. > > weighted_interleave_nid: > Gets the total weight of the nodemask as well as each individual > node weight, then calculates the node based on the given index. > > bulk_array_weighted_interleave: > Gets the total weight of the nodemask as well as each individual > node weight, then calculates the number of "interleave rounds" as > well as any delta ("partial round"). Calculates the number of > pages for each node and allocates them. > > If a node was scheduled for interleave via interleave_nodes, the > current weight (pol->cur_weight) will be allocated first, before > the remaining bulk calculation is done. > > One piece of complexity is the interaction between a recent refactor > which split the logic to acquire the "ilx" (interleave index) of an > allocation and the actually application of the interleave. The > calculation of the `interleave index` is done by `get_vma_policy()`, > while the actual selection of the node will be later appliex by the > relevant weighted_interleave function. > > Suggested-by: Hasan Al Maruf <Hasan.Maruf@xxxxxxx> > Signed-off-by: Gregory Price <gregory.price@xxxxxxxxxxxx> > Co-developed-by: Rakie Kim <rakie.kim@xxxxxx> > Signed-off-by: Rakie Kim <rakie.kim@xxxxxx> > Co-developed-by: Honggyu Kim <honggyu.kim@xxxxxx> > Signed-off-by: Honggyu Kim <honggyu.kim@xxxxxx> > Co-developed-by: Hyeongtak Ji <hyeongtak.ji@xxxxxx> > Signed-off-by: Hyeongtak Ji <hyeongtak.ji@xxxxxx> > Co-developed-by: Srinivasulu Thanneeru <sthanneeru.opensrc@xxxxxxxxxx> > Signed-off-by: Srinivasulu Thanneeru <sthanneeru.opensrc@xxxxxxxxxx> > Co-developed-by: Ravi Jonnalagadda <ravis.opensrc@xxxxxxxxxx> > Signed-off-by: Ravi Jonnalagadda <ravis.opensrc@xxxxxxxxxx> > --- > .../admin-guide/mm/numa_memory_policy.rst | 9 + > include/linux/mempolicy.h | 5 + > include/uapi/linux/mempolicy.h | 1 + > mm/mempolicy.c | 234 +++++++++++++++++- > 4 files changed, 246 insertions(+), 3 deletions(-) > > diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst > index eca38fa81e0f..a70f20ce1ffb 100644 > --- a/Documentation/admin-guide/mm/numa_memory_policy.rst > +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst > @@ -250,6 +250,15 @@ MPOL_PREFERRED_MANY > can fall back to all existing numa nodes. This is effectively > MPOL_PREFERRED allowed for a mask rather than a single node. > > +MPOL_WEIGHTED_INTERLEAVE > + This mode operates the same as MPOL_INTERLEAVE, except that > + interleaving behavior is executed based on weights set in > + /sys/kernel/mm/mempolicy/weighted_interleave/ > + > + Weighted interleave allocates pages on nodes according to a > + weight. For example if nodes [0,1] are weighted [5,2], 5 pages > + will be allocated on node0 for every 2 pages allocated on node1. > + > NUMA memory policy supports the following optional mode flags: > > MPOL_F_STATIC_NODES > diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h > index 931b118336f4..c1a083eb0dd5 100644 > --- a/include/linux/mempolicy.h > +++ b/include/linux/mempolicy.h > @@ -54,6 +54,11 @@ struct mempolicy { > nodemask_t cpuset_mems_allowed; /* relative to these nodes */ > nodemask_t user_nodemask; /* nodemask passed by user */ > } w; > + > + /* Weighted interleave settings */ > + struct { > + u8 cur_weight; > + } wil; > }; > > /* > diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h > index a8963f7ef4c2..1f9bb10d1a47 100644 > --- a/include/uapi/linux/mempolicy.h > +++ b/include/uapi/linux/mempolicy.h > @@ -23,6 +23,7 @@ enum { > MPOL_INTERLEAVE, > MPOL_LOCAL, > MPOL_PREFERRED_MANY, > + MPOL_WEIGHTED_INTERLEAVE, > MPOL_MAX, /* always last member of enum */ > }; > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index 427bddf115df..aa3b2389d3e0 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -19,6 +19,13 @@ > * for anonymous memory. For process policy an process counter > * is used. > * > + * weighted interleave > + * Allocate memory interleaved over a set of nodes based on > + * a set of weights (per-node), with normal fallback if it > + * fails. Otherwise operates the same as interleave. > + * Example: nodeset(0,1) & weights (2,1) - 2 pages allocated > + * on node 0 for every 1 page allocated on node 1. > + * > * bind Only allocate memory on a specific set of nodes, > * no fallback. > * FIXME: memory is allocated starting with the first node > @@ -313,6 +320,7 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags, > policy->mode = mode; > policy->flags = flags; > policy->home_node = NUMA_NO_NODE; > + policy->wil.cur_weight = 0; > > return policy; > } > @@ -425,6 +433,10 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = { > .create = mpol_new_nodemask, > .rebind = mpol_rebind_preferred, > }, > + [MPOL_WEIGHTED_INTERLEAVE] = { > + .create = mpol_new_nodemask, > + .rebind = mpol_rebind_nodemask, > + }, > }; > > static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist, > @@ -846,7 +858,8 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags, > > old = current->mempolicy; > current->mempolicy = new; > - if (new && new->mode == MPOL_INTERLEAVE) > + if (new && (new->mode == MPOL_INTERLEAVE || > + new->mode == MPOL_WEIGHTED_INTERLEAVE)) > current->il_prev = MAX_NUMNODES-1; > task_unlock(current); > mpol_put(old); > @@ -872,6 +885,7 @@ static void get_policy_nodemask(struct mempolicy *pol, nodemask_t *nodes) > case MPOL_INTERLEAVE: > case MPOL_PREFERRED: > case MPOL_PREFERRED_MANY: > + case MPOL_WEIGHTED_INTERLEAVE: > *nodes = pol->nodes; > break; > case MPOL_LOCAL: > @@ -956,6 +970,13 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask, > } else if (pol == current->mempolicy && > pol->mode == MPOL_INTERLEAVE) { > *policy = next_node_in(current->il_prev, pol->nodes); > + } else if (pol == current->mempolicy && > + (pol->mode == MPOL_WEIGHTED_INTERLEAVE)) { > + if (pol->wil.cur_weight) > + *policy = current->il_prev; > + else > + *policy = next_node_in(current->il_prev, > + pol->nodes); Per my understanding, we should always use "*policy = next_node_in()" here, as in weighted_interleave_nodes(). > } else { > err = -EINVAL; > goto out; > @@ -1785,7 +1806,8 @@ struct mempolicy *get_vma_policy(struct vm_area_struct *vma, > pol = __get_vma_policy(vma, addr, ilx); > if (!pol) > pol = get_task_policy(current); > - if (pol->mode == MPOL_INTERLEAVE) { > + if (pol->mode == MPOL_INTERLEAVE || > + pol->mode == MPOL_WEIGHTED_INTERLEAVE) { > *ilx += vma->vm_pgoff >> order; > *ilx += (addr - vma->vm_start) >> (PAGE_SHIFT + order); > } > @@ -1835,6 +1857,28 @@ bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone) > return zone >= dynamic_policy_zone; > } > > +static unsigned int weighted_interleave_nodes(struct mempolicy *policy) > +{ > + unsigned int next; > + struct task_struct *me = current; > + u8 __rcu *table; > + > + next = next_node_in(me->il_prev, policy->nodes); > + if (next == MAX_NUMNODES) > + return next; > + > + rcu_read_lock(); > + table = rcu_dereference(iw_table); > + if (!policy->wil.cur_weight) > + policy->wil.cur_weight = table ? table[next] : 1; > + rcu_read_unlock(); > + > + policy->wil.cur_weight--; > + if (!policy->wil.cur_weight) > + me->il_prev = next; > + return next; > +} > + [snip] -- Best Regards, Huang, Ying