On Thu, 9 Jan 2025 09:18:18 -0800 Joshua Hahn <joshua.hahnjy@xxxxxxxxx> wrote: > On Thu, 9 Jan 2025 10:56:20 -0500 Gregory Price <gourry@xxxxxxxxxx> wrote: > > > On Wed, Jan 08, 2025 at 10:19:19AM +0900, Hyeonggon Yoo wrote: > > > Hi, hope you all had a nice year-end holiday :) > > > > > ... snip ... > > > Please let me know if there's any point we discussed that I am missing. > > > > > > Additionally I would like to mention that within an internal discussion > > > my colleague Honggyu suggested introducing 'mode' parameter which can be > > > either 'manual' or 'auto' instead of 'use_defaults' to be provide more > > > intuitive interface. > > > > > > With Honggyu's suggestion and the points we've discussed, > > > I think the interface could be: > > > > > > # At booting, the mode is 'auto' where the kernel can automatically > > > # update any weights. > > > > > > mode auto # User hasn't specified any weight yet. > > > effective [2, 1, -, -] # Using system defaults for node 0-1, > > > # and node 2-3 not populated yet. > > > > > > # When a new NUMA node is added (e.g. via hotplug) in the 'auto' mode, > > > # all weights are re-calculated based on ACPI HMAT table, including the > > > # weight of the new node. > > > > > > mode auto # User hasn't specified weights yet. > > > effective [2, 1, 1, -] # Using system defaults for node 0-2, > > > # and node 3 not populated yet. > > > > > > # When user set at least one weight value, change the mode to 'manual' > > > # where the kernel does not update any weights automatically without > > > # user's consent. > > > > > > mode manual # User changed the weight of node 0 to 4, > > > # changing the mode to manual config mode. > > > effective [4, 1, 1, -] > > > > > > > > > # When a new NUMA node is added (e.g. via hotplug) in the manual mode, > > > # the new node's weight is zero because it's in manual mode and user > > > # did not specify the weight for the new node yet. > > > > > > mode manual > > > effective [4, 1, 1, 0] > > > > > > > 0's cannot show up in the effective list - the allocators can never > > percieve a 0 as there are (race) conditions where that may cause a div0. > > > > The actual content of the list may be 0, but the allocator will see '1'. > > > > IIRC this was due to lock/sleep limitations in the allocator paths and > > accessing this RCU protected memory. If someone wants to take another > > look at the allocator paths and characterize the risk more explicitly, > > this would be helpful. > > Hi Gregory and Hyeonggon, > > Based on a quick look, I see that there can be a problematic scenario > in alloc_pages_bulk_array_weighted_interleave where we sum up all > the weights from iw_table and divide by this sum. This _can_ be problematic > for two reasons, one of them being the div0 mentioned. > > Currently, you can access the weights in one of two ways: > The first way is to call get_il_weight, which will retrieve a specified > node's weight under an rcu read lock. Within this function, it first > checks if the value at iw_table[nid] is 0, and if it is, returns 1. > Although this prevents a div0 scenario by ensuring that all weights are > nonzero, there is a coherency problem, since each instance of get_il_weight > creates a new rcu read lock. Therefore, retrieving node weights within a > loop creates a race condition in which the state of iw_table may change > in between iterations of the loop. > > The second way is to directly dereference iw_table under a rcu lock, > copy its contents locally, then free the lock. This is how > alloc_pages_bulk_array_weighted_interleave currently calculates the sum. > The problem here is that even though we solve the coherency issue, there > is no check to ensure that this sum is zero. Thus, while having an array of > weights [0,0,0,0] gets translated into [1,1,1,1] when inspecting each > node individually using get_il_weight, it is still stored internally as 0 > and can lead to a div0 here. > > There are a few workarounds: > - Check that weight_total != 0 before performing the division. > - During the weight sum iteration, add by weights[node] ? weights[node] : 1 > like it is calculated within get_il_weight > - Prevent users from ever storing 0 into a node. > > Of course, we can implement all three of these changes to make sure that > there are no unforunate div0s. However, there are realistic scenarios > where we may want the node to actually have a weight of 0, so perhaps > it makes sense to just do the first to checks. I can write up a quick > patch to perform these checks, if it looks good to everyone. > > Please let me know if I missed anything as well. On second thought, the second bullet point doesn't make much sense, if we expect nodes to have 0 as a valid value. Here is something that could work for the first bullet point, though. I can send this as a separate patch since this is not explicitly related to this thread. diff --git a/mm/mempolicy.c b/mm/mempolicy.c index cb355bdcdd12..afb0f2a7bd4f 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -2552,10 +2552,13 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, * if (rounds > 0) and (delta == 0), resume_node will always be * the node following prev_node and its weight. */ - rounds = rem_pages / weight_total; - delta = rem_pages % weight_total; resume_node = next_node_in(prev_node, nodes); resume_weight = weights[resume_node]; + if (weight_total == 0) + goto out; + + rounds = rem_pages / weight_total; + delta = rem_pages % weight_total; for (i = 0; i < nnodes; i++) { node = next_node_in(prev_node, nodes); weight = weights[node]; @@ -2582,6 +2585,8 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, break; prev_node = node; } + +out: me->il_prev = resume_node; me->il_weight = resume_weight; kfree(weights); Of course, the only way this can happen is if a user purposefully sets all of the node weights to 0, so I don't think this is something that should ever happen naturally. Even with the new reduce_interleave_weights function, it manually checks and makes sure that the lowest possible value is 1. Again, please let me know if I am missing anything! Joshua