On Fri, Jan 26, 2024 at 03:10:49PM +0800, Huang, Ying wrote: > Gregory Price <gourry.memverge@xxxxxxxxx> writes: > > > + } else if (pol == current->mempolicy && > > + (pol->mode == MPOL_WEIGHTED_INTERLEAVE)) { > > + if (pol->cur_il_weight) > > + *policy = current->il_prev; > > + else > > + *policy = next_node_in(current->il_prev, > > + pol->nodes); > > It appears that my previous comments about this is ignored. > > https://lore.kernel.org/linux-mm/875xzkv3x2.fsf@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/ > > Please correct me if I am wrong. > The fix is in the following patch. I'd originally planned to squash the atomic patch into this one, but decided against it because it probably warranted isolated scrutiny. @@ -973,8 +974,10 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask, *policy = next_node_in(current->il_prev, pol->nodes); } else if (pol == current->mempolicy && (pol->mode == MPOL_WEIGHTED_INTERLEAVE)) { - if (pol->cur_il_weight) - *policy = current->il_prev; + int cweight = atomic_read(&pol->cur_il_weight); + + if (cweight & 0xFF) + *policy = cweight >> 8; in this we return the node the weight applies to, otherwise we return whatever is after il_prev. I can pull this fix ahead. > > + /* if now at 0, move to next node and set up that node's weight */ > > + if (unlikely(!policy->cur_il_weight)) { > > + me->il_prev = node; > > + next = next_node_in(node, policy->nodes); > > + rcu_read_lock(); > > + table = rcu_dereference(iw_table); > > + /* detect system-default values */ > > + weight = table ? table[next] : 1; > > + policy->cur_il_weight = weight ? weight : 1; > > + rcu_read_unlock(); > > + } > > It appears that the code could be more concise if we allow > policy->cur_il_weight == 0. Duplicated code are in > alloc_pages_bulk_array_weighted_interleave() too. Anyway, can we define > some function to reduce duplicated code. > This is kind of complicated by the next patch, which places the node and the weight into the same field to resolve the stale weight issue. In that patch (cur_il_weight = 0) means "cur_il_weight invalid", because the weight part can only be 0 when: a) an error occuring during bulk allocation b) a rebind event I'll take some time to think about whether we can do away with task->il_prev (as your next patch notes mentioned). > > + /* Otherwise we adjust nr_pages down, and continue from there */ > > + rem_pages -= pol->cur_il_weight; > > + pol->cur_il_weight = 0; > > This break the rule to keep pol->cur_il_weight != 0 except after initial > setup. Is it OK? > The only way cur_il_weight can leave this function 0 at this point is if an error occurs (specifically the failure to kmalloc immediately next). If we don't clear cur_il_weight here, then we have a stale weight, and the next allocation pass will over-allocate on the current node. This semantic also changes a bit in the next patch, but is basically the same. If il_weight is 0, then either an error occurred or a rebind event occured. > > + /* resume from this node w/ remaining weight */ > > + resume_node = prev_node; > > + resume_weight = weight - (node_pages % weight); > > resume_weight = weight - delta; ? > ack ~Gregory