On Fri, Nov 10, 2023 at 10:05:57AM +0100, Michal Hocko wrote: > On Thu 09-11-23 11:34:01, Gregory Price wrote: > [...] > > Anyway, summarizing: After a bit of reading, this does seem to map > > better to the "accounting consumption" subsystem than the "constrain" > > subsystem. However, if you think it's better suited for cpuset, I'm > > happy to push in that direction. > > Maybe others see it differently but I stick with my previous position. > Memcg is not a great fit for reasons already mentioned - most notably > that the controller doesn't control the allocation but accounting what > has been already allocated. Cpusets on the other hand constrains the > allocations and that is exactly what you want to achieve. > -- > Michal Hocko > SUSE Labs Digging in a bit, placing it in cpusets has locking requirements that concerns me. Maybe I'm being a bit over-cautious, so if none of this matters, then I'll go ahead and swap the code over to cpusets. Otherwise, just more food for thought in cpusets vs memcg. In cpusets.c it states when acquiring read-only access, we have to acquire the (global) callback lock: https://github.com/torvalds/linux/blob/master/kernel/cgroup/cpuset.c#L391 * There are two global locks guarding cpuset structures - cpuset_mutex and * callback_lock. We also require taking task_lock() when dereferencing a * task's cpuset pointer. See "The task_lock() exception", at the end of this * comment. Examples: cpuset_node_allowed: https://github.com/torvalds/linux/blob/master/kernel/cgroup/cpuset.c#L4780 spin_lock_irqsave(&callback_lock, flags); rcu_read_lock(); cs = nearest_hardwall_ancestor(task_cs(current)); <-- walks parents allowed = node_isset(node, cs->mems_allowed); rcu_read_unlock(); spin_unlock_irqrestore(&callback_lock, flags); cpuset_mems_allowed: https://github.com/torvalds/linux/blob/master/kernel/cgroup/cpuset.c#L4679 spin_lock_irqsave(&callback_lock, flags); rcu_read_lock(); guarantee_online_mems(task_cs(tsk), &mask); <-- walks parents rcu_read_unlock(); spin_unlock_irqrestore(&callback_lock, flags); Seems apparent that any form of parent walk in cpusets will require the acquisition of &callback_lock. This does not appear true of memcg. Implementing a similar inheritance structure as described in this patch set would therefore cause the acquisition of the callback lock during node selection. So if we want this in cpuset, we're going to eat that lock acquisition, despite not really needing it. I'm was not intending to do checks against cpusets.mems_allowed when acquiring weights, as this is already enforced between cpusets and mempolicy on hotplug and mask changes, as well as in the allocators via read_mems_allowed_begin/retry.. This is why I said this was *not* a constraining feature. Additionally, if the node selected by mpol is exhausted, the allocator will simply acquire memory from another (allowed) node, disregarding the weights entirely (which is the correct / expected behavior). Another example of "this is more of a suggestion" rather than a constraint. So I'm contending here that putting it in cpusets is overkill. But if it likewise doesn't fit in memcg, is it insane to suggest that maybe we should consider adding cgroup.mpol, and maybe consider migrating features from mempolicy.c into cgroups (while keeping mpol the way it is). ~Gregory