On Tue, Nov 14, 2023 at 06:01:13PM +0100, Michal Hocko wrote: > On Tue 14-11-23 10:50:51, Gregory Price wrote: > > On Tue, Nov 14, 2023 at 10:43:13AM +0100, Michal Hocko wrote: > [...] > > > That being said, I still believe that a cgroup based interface is a much > > > better choice over a global one. Cpusets seem to be a good fit as the > > > controller does control memory placement wrt NUMA interfaces. > > > > I think cpusets is a non-starter due to the global spinlock required when > > reading informaiton from it: > > > > https://elixir.bootlin.com/linux/latest/source/kernel/cgroup/cpuset.c#L391 > > Right, our current cpuset implementation indeed requires callback lock > from the page allocator. But that is an implementation detail. I do not > remember bug reports about the lock being a bottle neck though. If > anything cpusets lock optimizations would be win also for users who do > not want to use weighted interleave interface. Definitely agree, but that's a rather large increase of scope :[ We could consider a push-model similar to how cpuset nodemasks are pushed down to mempolicies, rather than a pull-model of having mempolicy read directly from cpusets, at least until cpusets lock optimization is undertaken. This pattern looks like a wart to me, which is why I avoided it, but the locking implications on the pull-model make me sad. Would like to point out that Tejun pushed back on implementing weights in cgroups (regardless of subcomponent), so I think we need to come to a consensus on where this data should live in a "more global" context (cpusets, memcg, nodes, etc) before I go mucking around further. So far we have: * mempolicy: updating weights is a very complicated undertaking, and no (good) way to do this from outside the task. would be better to have a coarser grained control. New syscall is likely needed to add/set weights in the per-task mempolicy, or bite the bullet on set_mempolicy2 and make the syscall extensible for the future. * memtiers: tier=node when devices are already interleaved or when all devices are different, so why add yet another layer of complexity if other constructs already exist. Additionally, you lose task-placement relative weighting (or it becomes very complex to implement. * cgroups: "this doesn't involve dynamic resource accounting / enforcement at all" and "these aren't resource allocations, it's unclear what the hierarchical relationship mean". * node: too global, explore smaller scope first then expand. For now I think there is consensus that mempolicy should have weights per-task regardless of how the more-global mechanism is defined, so i'll go ahead and put up another RFC for some options on that in the next week or so. The limitations on the first pass will be that only the task is capable of re-weighting should cpusets.mems or the nodemask change. ~Gregory