Hi, Joshua, Sorry for late reply. Joshua Hahn <joshua.hahnjy@xxxxxxxxx> writes: > On Thu, 9 Jan 2025 09:18:18 -0800 Joshua Hahn <joshua.hahnjy@xxxxxxxxx> wrote: > >> On Thu, 9 Jan 2025 10:56:20 -0500 Gregory Price <gourry@xxxxxxxxxx> wrote: >> >> > On Wed, Jan 08, 2025 at 10:19:19AM +0900, Hyeonggon Yoo wrote: >> > > Hi, hope you all had a nice year-end holiday :) >> > > >> > ... snip ... >> > > Please let me know if there's any point we discussed that I am missing. >> > > >> > > Additionally I would like to mention that within an internal discussion >> > > my colleague Honggyu suggested introducing 'mode' parameter which can be >> > > either 'manual' or 'auto' instead of 'use_defaults' to be provide more >> > > intuitive interface. >> > > >> > > With Honggyu's suggestion and the points we've discussed, >> > > I think the interface could be: >> > > >> > > # At booting, the mode is 'auto' where the kernel can automatically >> > > # update any weights. >> > > >> > > mode auto # User hasn't specified any weight yet. >> > > effective [2, 1, -, -] # Using system defaults for node 0-1, >> > > # and node 2-3 not populated yet. >> > > >> > > # When a new NUMA node is added (e.g. via hotplug) in the 'auto' mode, >> > > # all weights are re-calculated based on ACPI HMAT table, including the >> > > # weight of the new node. >> > > >> > > mode auto # User hasn't specified weights yet. >> > > effective [2, 1, 1, -] # Using system defaults for node 0-2, >> > > # and node 3 not populated yet. >> > > >> > > # When user set at least one weight value, change the mode to 'manual' >> > > # where the kernel does not update any weights automatically without >> > > # user's consent. >> > > >> > > mode manual # User changed the weight of node 0 to 4, >> > > # changing the mode to manual config mode. >> > > effective [4, 1, 1, -] >> > > >> > > >> > > # When a new NUMA node is added (e.g. via hotplug) in the manual mode, >> > > # the new node's weight is zero because it's in manual mode and user >> > > # did not specify the weight for the new node yet. >> > > >> > > mode manual >> > > effective [4, 1, 1, 0] >> > > >> > >> > 0's cannot show up in the effective list - the allocators can never >> > percieve a 0 as there are (race) conditions where that may cause a div0. >> > >> > The actual content of the list may be 0, but the allocator will see '1'. >> > >> > IIRC this was due to lock/sleep limitations in the allocator paths and >> > accessing this RCU protected memory. If someone wants to take another >> > look at the allocator paths and characterize the risk more explicitly, >> > this would be helpful. >> >> Hi Gregory and Hyeonggon, >> >> Based on a quick look, I see that there can be a problematic scenario >> in alloc_pages_bulk_array_weighted_interleave where we sum up all >> the weights from iw_table and divide by this sum. This _can_ be problematic >> for two reasons, one of them being the div0 mentioned. >> >> Currently, you can access the weights in one of two ways: >> The first way is to call get_il_weight, which will retrieve a specified >> node's weight under an rcu read lock. Within this function, it first >> checks if the value at iw_table[nid] is 0, and if it is, returns 1. >> Although this prevents a div0 scenario by ensuring that all weights are >> nonzero, there is a coherency problem, since each instance of get_il_weight >> creates a new rcu read lock. Therefore, retrieving node weights within a >> loop creates a race condition in which the state of iw_table may change >> in between iterations of the loop. >> >> The second way is to directly dereference iw_table under a rcu lock, >> copy its contents locally, then free the lock. This is how >> alloc_pages_bulk_array_weighted_interleave currently calculates the sum. >> The problem here is that even though we solve the coherency issue, there >> is no check to ensure that this sum is zero. Thus, while having an array of >> weights [0,0,0,0] gets translated into [1,1,1,1] when inspecting each >> node individually using get_il_weight, it is still stored internally as 0 >> and can lead to a div0 here. >> >> There are a few workarounds: >> - Check that weight_total != 0 before performing the division. >> - During the weight sum iteration, add by weights[node] ? weights[node] : 1 >> like it is calculated within get_il_weight >> - Prevent users from ever storing 0 into a node. >> >> Of course, we can implement all three of these changes to make sure that >> there are no unforunate div0s. However, there are realistic scenarios >> where we may want the node to actually have a weight of 0, so perhaps >> it makes sense to just do the first to checks. I can write up a quick >> patch to perform these checks, if it looks good to everyone. >> >> Please let me know if I missed anything as well. > > On second thought, the second bullet point doesn't make much sense, if > we expect nodes to have 0 as a valid value. Here is something that could > work for the first bullet point, though. I can send this as a separate > patch since this is not explicitly related to this thread. > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index cb355bdcdd12..afb0f2a7bd4f 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -2552,10 +2552,13 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, > * if (rounds > 0) and (delta == 0), resume_node will always be > * the node following prev_node and its weight. > */ > - rounds = rem_pages / weight_total; > - delta = rem_pages % weight_total; > resume_node = next_node_in(prev_node, nodes); > resume_weight = weights[resume_node]; > + if (weight_total == 0) > + goto out; > + > + rounds = rem_pages / weight_total; > + delta = rem_pages % weight_total; > for (i = 0; i < nnodes; i++) { > node = next_node_in(prev_node, nodes); > weight = weights[node]; > @@ -2582,6 +2585,8 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, > break; > prev_node = node; > } > + > +out: > me->il_prev = resume_node; > me->il_weight = resume_weight; > kfree(weights); > > Of course, the only way this can happen is if a user purposefully > sets all of the node weights to 0, so I don't think this is something > that should ever happen naturally. Even with the new reduce_interleave_weights > function, it manually checks and makes sure that the lowest possible value > is 1. > > Again, please let me know if I am missing anything! I don't think that "0" is a valid weight value. If you don't want to allocate pages on some nodes, just don't specify them in the node mask parameter when you call mbind() or set_mempolicy(). --- Best Regards, Huang, Ying