Gregory Price <gregory.price@xxxxxxxxxxxx> writes: > On Thu, Nov 02, 2023 at 10:47:33AM +0100, Michal Hocko wrote: >> On Wed 01-11-23 12:58:55, Gregory Price wrote: >> > Basically consider: `numactl --interleave=all ...` >> > >> > If `--weights=...`: when a node hotplug event occurs, there is no >> > recourse for adding a weight for the new node (it will default to 1). >> >> Correct and this is what I was asking about in an earlier email. How >> much do we really need to consider this setup. Is this something nice to >> have or does the nature of the technology requires to be fully dynamic >> and expect new nodes coming up at any moment? >> > > Dynamic Capacity is expected to cause a numa node to change size (in > number of memory blocks) rather than cause numa nodes to come and go, so > maybe handling the full node hotplug is a bit of an overreach. Will node max bandwidth change with the number of memory blocks? > Good call, I'll stop considering this problem for now. > >> > If the node is removed from the system, I believe (need to validate >> > this, but IIRC) the node will be removed from any registered cpusets. >> > As a result, that falls down to mempolicy, and the node is removed. >> >> I do not think we do anything like that. Userspace might decide to >> change the numa mask when a node is offlined but I do not think we do >> anything like that automagically. >> > > mpol_rebind_policy called by update_tasks_nodemask > https://elixir.bootlin.com/linux/latest/source/mm/mempolicy.c#L319 > https://elixir.bootlin.com/linux/latest/source/kernel/cgroup/cpuset.c#L2016 > > falls down from cpuset_hotplug_workfn: > https://elixir.bootlin.com/linux/latest/source/kernel/cgroup/cpuset.c#L3771 > > /* > * Keep top_cpuset.mems_allowed tracking node_states[N_MEMORY]. > * Call this routine anytime after node_states[N_MEMORY] changes. > * See cpuset_update_active_cpus() for CPU hotplug handling. > */ > static int cpuset_track_online_nodes(struct notifier_block *self, > unsigned long action, void *arg) > { > schedule_work(&cpuset_hotplug_work); > return NOTIFY_OK; > } > > void __init cpuset_init_smp(void) > { > ... > hotplug_memory_notifier(cpuset_track_online_nodes, CPUSET_CALLBACK_PRI); > } > > > Causes 1 of 3 situations: > MPOL_F_STATIC_NODES: overwrite with (old & new) > MPOL_F_RELATIVE_NODES: overwrite with a "relative" nodemask (fold+onto?) > Default: either does a remap or replaces old with new. > > My assumption based on this is that a hot-unplugged node would completely > be removed. Doesn't look like hot-add is handled at all, so I can just > drop that entirely for now (except add default weight of 1 incase it is > ever added in the future). > > I've been pushing agianst the weights being in memory-tiers.c for this > reason, as a weight set per-tier is meaningless if a node disappears. > > Example: Tier has 2 nodes with some weight N split between them, such > that interleave gives each node N/2 pages. If 1 node is removed, the > remaining node gets N pages, which is twice the allocation. Presumably > a node is an abstraction of 1 or more devices, therefore if the node is > removed, the weight should change. The per-tier weight can be defined as interleave weight of each node of the tier. Tier just groups NUMA nodes with similar performance. The performance (including bandwidth) is still per-node in the context of tier. If we have multiple nodes in one tier, this makes weight definition easier. > You could handle hotplug in tiers, but if a node being hotplugged forcibly > removes the node from cpusets and mempolicy nodemasks, then it's > irrelevant since the node can never get selected for allocation anyway. > > It's looking more like cgroups is the right place to put this. Have a cgroup/task level interface doesn't prevent us to have a system level interface to provide default for cgroups/tasks. Where performance information (e.g., from HMAT) can help define a reasonable default automatically. >> >> Moving the global policy to cgroups would make the main cocern of >> different workloads looking for different policy less problamatic. >> I didn't have much time to think that through but the main question is >> how to sanely define hierarchical properties of those weights? This is >> more of a resource distribution than enforcement so maybe a simple >> inherit or overwrite (if you have a more specific needs) semantic makes >> sense and it is sufficient. >> > > As a user I would assume it would operate much the same way as other > nested cgroups, which is inherit by default (with subsets) or an > explicit overwrite that can't exceed the higher level settings. > > Weights could arguably allow different settings than capacity controls, > but that could be an extension. > >> This is not as much about the code as it is about the proper interface >> because that will get cast in stone once introduced. It would be really >> bad to realize that we have a global policy that doesn't fit well and >> have hard time to work it around without breaking anybody. > > o7 I concur now. I'll take some time to rework this into a > cgroups+mempolicy proposal based on my earlier RFCs. -- Best Regards, Huang, Ying