On Wed, Jan 24, 2024 at 09:51:20AM +0800, Huang, Ying wrote: > Gregory Price <gregory.price@xxxxxxxxxxxx> writes: > > + if (new && (new->mode == MPOL_INTERLEAVE || > + new->mode == MPOL_WEIGHTED_INTERLEAVE)) > current->il_prev = MAX_NUMNODES-1; > task_unlock(current); > mpol_put(old); > > I don't think we need to change this. > Ah you're right it's set to MAX_NUMNODES-1 here, but NUMA_NO_NODE can be passed in as an argument to alloc_pages_bulk_array_mempolicy, like here: vm_area_alloc_pages() if (IS_ENABLED(CONFIG_NUMA) && nid == NUMA_NO_NODE) nr = alloc_pages_bulk_array_mempolicy(bulk_gfp, nr_pages_request, pages + nr_allocated); > > (cur_weight = 0) can happen in two scenarios: > > - initial setting of mempolicy (NUMA_NO_NODE w/ cur_weight=0) > > - weighted_interleave_nodes decrements it down to 0 > > > > Now that i'm looking at it - the second condition should not exist, and > > we can eliminate it. The logic in weighted_interleave_nodes is actually > > annoyingly unclear at the moment, so I'm going to re-factor it a bit to > > be more explicit. > > I am OK with either way. Just a reminder, the first condition may be > true in alloc_pages_bulk_array_weighted_interleave() and perhaps some > other places. > Yeah, the bulk allocator handles it correctly, it's just a matter of clarity for weighted_interleave_nodes. What isn't necessarily handled correctly is the rebind code. Rebind due to a cgroup/mems_allowed change can cause a stale weight to be carried. Basically cur_weight is not cleared, but the node it applied to may no longer be the next node when next_node_in() is called. The race condition is 1) exceedingly rare, and 2) not necessarily harmful, just inaccurate. The worst case scenario is that a node receives up to 255 additional allocations once after a rebind (but more likely 10-20). I was considering forcing the interleave forward like this: @@ -356,6 +361,10 @@ static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes) tmp = *nodes; pol->nodes = tmp; + + /* Weighted interleave policies are forced forward to the next node */ + if (pol->mode & MPOL_WEIGHTED_INTERLEAVE) + pol->wil.cur_weight = 0; } But this creates 2 race conditions when we read cur_weight and nodemask in the allocator path. Example 1: 1) bulk allocator READ_ONCE(mask), READ_ONCE(cur_weight) 2) rebind changes nodemask and { cur_weight = 0; } 3) bulk allocator sets pol->wil.cur_weight In this scenario, resume_weight is stale coming out of bulk allocations if the resume_node has been removed from the node mask. Example 2: 1) rebind changes nodemask 2) bulk allocator READ_ONCE(mask), READ_ONCE(cur_weight) 3) rebind sets { cur_weight = 0; } In this scenario, cur_weight is stale going into bulk allocations. Neither of these can force a violation of mems_allowed, just a mis-application of a weight. I'll need to think on this a bit. We can either leave this as-is, meaning the first allocation after a rebind may apply the wrong weight to a node, or we can try to track the current-interleave-node and validate next_node_in(mask) == current-interleave-node before leaving the allocator path (this may also be just as racey). turns out concurrent counting is still hard :] ~Gregory